Strong systems engineering background with c++ cuda python
Hyphen Partners is looking for an LLM Pre-training & Distributed Systems Engineer to manage large-scale machine learning training runs and optimize distributed infrastructure. The ideal candidate should possess extensive experience with GPU clusters and system engineering, particularly in orchestrating distributed training processes
Job Summary
This role is essential for orchestrating large-scale machine learning training runs across extensive GPU clusters.
The ideal candidate will have a deep understanding of GPU clusters and extensive experience in system engineering.
Responsibilities include optimizing networking and memory management to ensure reliable month-long training processes.
Matching Summary
Match Score: 85
Hyphen Partners is looking for an LLM Pre-training & Distributed Systems Engineer to manage large-scale machine learning training runs and optimize distributed infrastructure. The ideal candidate should possess extensive experience with GPU clusters and system engineering, particularly in orchestrating distributed training processes.
Skills & Requirements
Must-have
Deep expertise in 3D parallelism
Experience managing SLURM or Kubernetes clusters
Strong systems engineering background with C++ CUDA Python
Nice-to-have
Optimizing networking InfiniBand RDMA performance
Automating checkpointing and failure recovery processes
Preventing out-of-memory errors during training
Key Requirements
Deep expertise in Data Tensor Pipeline parallelism
Experience managing SLURM or Kubernetes-based GPU clusters
Strong systems engineering background including C++ CUDA and Python