Hyphen Partners is looking for an experienced LLM Pre-training & Distributed Systems Engineer to manage large-scale machine learning training runs and optimize distributed infrastructure in China. The ideal candidate will possess deep expertise in GPU clusters and systems engineering, ensuring efficient training processes
Job Summary
This role is essential for orchestrating large-scale machine learning training runs across massive GPU clusters.
The ideal candidate will have a deep understanding of GPU clusters and extensive experience in system engineering.
Responsibilities include optimizing networking and memory management to prevent out-of-memory errors during month-long training runs.
Matching Summary
Match Score: 85
Hyphen Partners is looking for an experienced LLM Pre-training & Distributed Systems Engineer to manage large-scale machine learning training runs and optimize distributed infrastructure in China. The ideal candidate will possess deep expertise in GPU clusters and systems engineering, ensuring efficient training processes.
Skills & Requirements
Must-have
Distributed training on 1000+ GPUs
Deep expertise in 3D parallelism
Experience with PyTorch DeepSpeed Megatron-LM
Optimization of InfiniBand RDMA networking
Strong systems engineering background
Nice-to-have
Automated checkpointing strategies
Failure recovery for long runs
Memory management optimization skills
Key Requirements
Deep expertise in 3D parallelism (Data, Tensor, Pipeline)
Experience managing SLURM or Kubernetes-based GPU clusters
Strong systems engineering background (C++, CUDA, Python)