Hyphen Partners is looking for an experienced LLM Pre-training & Distributed Systems Engineer to optimize large-scale machine learning training runs and distributed infrastructure. The ideal candidate will possess expertise in GPU clusters, parallelism, and systems engineering, particularly using technologies like PyTorch and Kubernetes
Job Summary
This role is essential for orchestrating large-scale machine learning training runs across massive GPU clusters.
The ideal candidate will have a deep understanding of GPU clusters and extensive experience in system engineering.
Responsibilities include optimizing networking and memory management to prevent out-of-memory errors during month-long training runs.
Matching Summary
Match Score: 85
Hyphen Partners is looking for an experienced LLM Pre-training & Distributed Systems Engineer to optimize large-scale machine learning training runs and distributed infrastructure. The ideal candidate will possess expertise in GPU clusters, parallelism, and systems engineering, particularly using technologies like PyTorch and Kubernetes.
Skills & Requirements
Must-have
Deep expertise in 3D parallelism strategies
Experience managing SLURM or Kubernetes clusters
Strong systems engineering background with C++
CUDA and Python programming proficiency
Optimizing networking via InfiniBand and RDMA
Nice-to-have
Experience with PyTorch DeepSpeed Megatron-LM
Knowledge of automated checkpointing systems
Ability to handle failure recovery processes
Key Requirements
Deep expertise in Data Tensor Pipeline parallelism