Llm Pre-training & Distributed Engineer (ai Infrastructure)

Hyphen Connect

Hong Kong, Hong Kong, Hong Kong
On-site
Distributed training across 1,000+ gpus
Deep expertise in 3d parallelism
Experience with pytorch deepspeed megatron-lm
Hyphen Connect is looking for an experienced LLM Pre-training & Distributed Systems Engineer to manage large-scale machine learning training operations and optimize distributed infrastructure. The ideal candidate should possess deep expertise in GPU clusters and systems engineering, particularly with tools like PyTorch and Kubernetes

Job Summary

  • This role is essential for orchestrating large-scale machine learning training runs and optimizing distributed infrastructure.
  • The ideal candidate will have a deep understanding of GPU clusters and extensive experience in system engineering to ensure efficient and reliable training processes.
  • Responsibilities include automating checkpointing and failure recovery during month-long training runs.

Matching Summary

Match Score: 85

Hyphen Connect is looking for an experienced LLM Pre-training & Distributed Systems Engineer to manage large-scale machine learning training operations and optimize distributed infrastructure. The ideal candidate should possess deep expertise in GPU clusters and systems engineering, particularly with tools like PyTorch and Kubernetes.

Skills & Requirements

Must-have

  • Distributed training across 1,000+ GPUs
  • Deep expertise in 3D parallelism
  • Experience with PyTorch DeepSpeed Megatron-LM
  • Optimizing InfiniBand RDMA networking
  • Managing SLURM or Kubernetes GPU clusters

Nice-to-have

  • Strong systems engineering background
  • Proficiency in C++ CUDA Python
  • Automating checkpointing and failure recovery

Key Requirements

  • Deep expertise in 3D parallelism (Data Tensor Pipeline)
  • Experience managing SLURM or Kubernetes-based GPU clusters
  • Strong systems engineering background (C++ CUDA Python)

Work Rights

Not specified

Tailored Resume

Cover Letter