Deep expertise in 3d parallelism data tensor pipeline
Experience managing slurm or kubernetes-based gpu clusters
Strong systems engineering background c++ cuda python
This role is essential for orchestrating large-scale machine learning training runs and optimizing distributed infrastructure
Job Summary
This role is essential for orchestrating large-scale machine learning training runs and optimizing distributed infrastructure.
The ideal candidate will have a deep understanding of GPU clusters and extensive experience in system engineering to ensure efficient and reliable training processes.
Responsibilities include automating checkpointing and failure recovery during month-long training runs.
Matching Summary
This role is essential for orchestrating large-scale machine learning training runs and optimizing distributed infrastructure.
Skills & Requirements
Must-have
Deep expertise in 3D parallelism Data Tensor Pipeline
Experience managing SLURM or Kubernetes-based GPU clusters
Strong systems engineering background C++ CUDA Python
Orchestrate distributed training runs across 1,000+ GPUs
Optimize networking InfiniBand RDMA and memory management
Nice-to-have
Extensive experience in system engineering
Deep understanding of GPU clusters
Ensuring efficient and reliable training processes
Key Requirements
Deep expertise in 3D parallelism
Experience managing SLURM or Kubernetes-based GPU clusters
Strong systems engineering background including C++, CUDA, Python