Llm Pre-training & Distributed Engineer (ai Infrastructure)

Hyphen Connect

Singapore, Singapore
**
Distributed training across 1,000+ gpus
Deep expertise in 3d parallelism
Experience with pytorch deepspeed megatron-lm
** Hyphen Connect is seeking an LLM Pre-training & Distributed Systems Engineer to manage and optimize large-scale machine learning training runs using GPU clusters. The ideal candidate will have strong expertise in systems engineering, particularly with parallelism and managing distributed systems. **

Job Summary

  • This role is essential for orchestrating large-scale machine learning training runs and optimizing distributed infrastructure.
  • The ideal candidate will have a deep understanding of GPU clusters and extensive experience in system engineering to ensure efficient and reliable training processes.
  • Responsibilities include automating checkpointing and failure recovery during month-long training runs.

Matching Summary

Match Score: 75

** Hyphen Connect is seeking an LLM Pre-training & Distributed Systems Engineer to manage and optimize large-scale machine learning training runs using GPU clusters. The ideal candidate will have strong expertise in systems engineering, particularly with parallelism and managing distributed systems. **

Skills & Requirements

Must-have

  • Distributed training across 1,000+ GPUs
  • Deep expertise in 3D parallelism
  • Experience with PyTorch DeepSpeed Megatron-LM
  • Optimization of InfiniBand RDMA networking
  • Strong systems engineering background

Nice-to-have

  • Automated checkpointing strategies
  • Failure recovery during long runs
  • Memory management optimization skills

Key Requirements

  • Deep expertise in 3D parallelism
  • Experience managing SLURM or Kubernetes-based GPU clusters
  • Strong systems engineering background in C++ CUDA Python

Work Rights

Not specified

Tailored Resume

Cover Letter