Llm Pre-training & Distributed Engineer (ai Infrastructure)

Hyphen Partners

China
On-site
Distributed training on 1000+ gpus
Deep expertise in 3d parallelism
Experience with pytorch deepspeed megatron-lm
Hyphen Partners is looking for an experienced LLM Pre-training & Distributed Systems Engineer to manage large-scale machine learning training runs and optimize distributed infrastructure in China. The ideal candidate will possess deep expertise in GPU clusters and systems engineering, ensuring efficient training processes

Job Summary

  • This role is essential for orchestrating large-scale machine learning training runs across massive GPU clusters.
  • The ideal candidate will have a deep understanding of GPU clusters and extensive experience in system engineering.
  • Responsibilities include optimizing networking and memory management to prevent out-of-memory errors during month-long training runs.

Matching Summary

Match Score: 85

Hyphen Partners is looking for an experienced LLM Pre-training & Distributed Systems Engineer to manage large-scale machine learning training runs and optimize distributed infrastructure in China. The ideal candidate will possess deep expertise in GPU clusters and systems engineering, ensuring efficient training processes.

Skills & Requirements

Must-have

  • Distributed training on 1000+ GPUs
  • Deep expertise in 3D parallelism
  • Experience with PyTorch DeepSpeed Megatron-LM
  • Optimization of InfiniBand RDMA networking
  • Strong systems engineering background

Nice-to-have

  • Automated checkpointing strategies
  • Failure recovery for long runs
  • Memory management optimization skills

Key Requirements

  • Deep expertise in 3D parallelism (Data, Tensor, Pipeline)
  • Experience managing SLURM or Kubernetes-based GPU clusters
  • Strong systems engineering background (C++, CUDA, Python)

Work Rights

Not specified

Tailored Resume

Cover Letter