Llm Pre-training & Distributed Engineer (ai Infrastructure)

Hyphen Partners

Seattle, United States
On-site
Distributed training across 1,000+ gpus
Deep expertise in 3d parallelism
Experience with pytorch deepspeed megatron-lm
This role is essential for orchestrating large-scale machine learning training runs and optimizing distributed infrastructure

Job Summary

  • This role is essential for orchestrating large-scale machine learning training runs and optimizing distributed infrastructure.
  • The ideal candidate will have a deep understanding of GPU clusters and extensive experience in system engineering to ensure efficient and reliable training processes.
  • Responsibilities include automating checkpointing and failure recovery during month-long training runs.

Matching Summary

This role is essential for orchestrating large-scale machine learning training runs and optimizing distributed infrastructure.

Skills & Requirements

Must-have

  • Distributed training across 1,000+ GPUs
  • Deep expertise in 3D parallelism
  • Experience with PyTorch DeepSpeed Megatron-LM
  • Optimizing InfiniBand RDMA networking
  • Managing SLURM or Kubernetes clusters

Nice-to-have

  • Strong systems engineering background
  • Experience with C++ and CUDA
  • Automated checkpointing strategies

Key Requirements

  • Deep expertise in 3D parallelism (Data, Tensor, Pipeline)
  • Experience managing SLURM or Kubernetes-based GPU clusters
  • Strong systems engineering background (C++, CUDA, Python)

Work Rights

Not specified

Tailored Resume

Cover Letter