This role is essential for orchestrating large-scale machine learning training runs and optimizing distributed infrastructure
Job Summary
This role is essential for orchestrating large-scale machine learning training runs and optimizing distributed infrastructure.
The ideal candidate will have a deep understanding of GPU clusters and extensive experience in system engineering to ensure efficient and reliable training processes.
Responsibilities include automating checkpointing and failure recovery during month-long training runs.
Matching Summary
This role is essential for orchestrating large-scale machine learning training runs and optimizing distributed infrastructure.
Skills & Requirements
Must-have
Distributed training across 1,000+ GPUs
Deep expertise in 3D parallelism
Experience with PyTorch DeepSpeed Megatron-LM
Optimizing InfiniBand RDMA networking
Managing SLURM or Kubernetes clusters
Nice-to-have
Strong systems engineering background
Experience with C++ and CUDA
Automated checkpointing strategies
Key Requirements
Deep expertise in 3D parallelism (Data, Tensor, Pipeline)
Experience managing SLURM or Kubernetes-based GPU clusters
Strong systems engineering background (C++, CUDA, Python)