**
Hyphen Connect is seeking an LLM Pre-training & Distributed Systems Engineer to manage and optimize large-scale machine learning training runs using GPU clusters. The ideal candidate will have strong expertise in systems engineering, particularly with parallelism and managing distributed systems.
**
Job Summary
This role is essential for orchestrating large-scale machine learning training runs and optimizing distributed infrastructure.
The ideal candidate will have a deep understanding of GPU clusters and extensive experience in system engineering to ensure efficient and reliable training processes.
Responsibilities include automating checkpointing and failure recovery during month-long training runs.
Matching Summary
Match Score: 75
**
Hyphen Connect is seeking an LLM Pre-training & Distributed Systems Engineer to manage and optimize large-scale machine learning training runs using GPU clusters. The ideal candidate will have strong expertise in systems engineering, particularly with parallelism and managing distributed systems.
**
Skills & Requirements
Must-have
Distributed training across 1,000+ GPUs
Deep expertise in 3D parallelism
Experience with PyTorch DeepSpeed Megatron-LM
Optimization of InfiniBand RDMA networking
Strong systems engineering background
Nice-to-have
Automated checkpointing strategies
Failure recovery during long runs
Memory management optimization skills
Key Requirements
Deep expertise in 3D parallelism
Experience managing SLURM or Kubernetes-based GPU clusters
Strong systems engineering background in C++ CUDA Python