Deep expertise in 3d parallelism data tensor pipeline
Experience managing slurm or kubernetes-based gpu clusters
Strong systems engineering background c++ cuda python
Hyphen Partners is looking for a skilled LLM Pre-training & Distributed Systems Engineer to optimize large-scale machine learning training runs and distributed infrastructure in Singapore. The ideal candidate should possess deep expertise in GPU clusters and systems engineering, with a strong focus on automating and managing extensive training processes
Job Summary
This role is essential for orchestrating large-scale machine learning training runs across 1,000+ GPUs using PyTorch, DeepSpeed, or Megatron-LM.
The ideal candidate will have a deep understanding of GPU clusters and extensive experience in system engineering to ensure efficient and reliable training processes.
Responsibilities include optimizing networking and memory management while automating checkpointing and failure recovery during month-long training runs.
Matching Summary
Match Score: 85
Hyphen Partners is looking for a skilled LLM Pre-training & Distributed Systems Engineer to optimize large-scale machine learning training runs and distributed infrastructure in Singapore. The ideal candidate should possess deep expertise in GPU clusters and systems engineering, with a strong focus on automating and managing extensive training processes.
Skills & Requirements
Must-have
Deep expertise in 3D parallelism Data Tensor Pipeline
Experience managing SLURM or Kubernetes-based GPU clusters
Strong systems engineering background C++ CUDA Python
Nice-to-have
Optimization of networking InfiniBand RDMA
Memory management to prevent out-of-memory errors
Automate checkpointing and failure recovery processes