Llm Pre-training & Distributed Engineer (ai Infrastructure)

Hyphen Connect

Australia, Australia
On-site
Deep expertise in 3d parallelism data tensor pipeline
Experience managing slurm or kubernetes-based gpu clusters
Strong systems engineering background c++ cuda python
Hyphen Connect is seeking a skilled LLM Pre-training & Distributed Systems Engineer to manage large-scale machine learning training operations and optimize distributed infrastructure. The ideal candidate should have expertise in GPU clusters and systems engineering, particularly in orchestrating training runs and ensuring efficient processes

Job Summary

  • This role is essential for orchestrating large-scale machine learning training runs across 1,000+ GPUs using PyTorch, DeepSpeed, or Megatron-LM.
  • The ideal candidate will have a deep understanding of GPU clusters and extensive experience in system engineering to ensure efficient and reliable training processes.
  • Responsibilities include optimizing networking and memory management while automating checkpointing and failure recovery during month-long training runs.

Matching Summary

Match Score: 85

Hyphen Connect is seeking a skilled LLM Pre-training & Distributed Systems Engineer to manage large-scale machine learning training operations and optimize distributed infrastructure. The ideal candidate should have expertise in GPU clusters and systems engineering, particularly in orchestrating training runs and ensuring efficient processes.

Skills & Requirements

Must-have

  • Deep expertise in 3D parallelism Data Tensor Pipeline
  • Experience managing SLURM or Kubernetes-based GPU clusters
  • Strong systems engineering background C++ CUDA Python

Nice-to-have

  • Optimizing networking InfiniBand RDMA memory management
  • Automating checkpointing and failure recovery processes
  • Preventing out-of-memory errors during training

Key Requirements

  • Deep expertise in 3D parallelism strategies
  • Experience with SLURM or Kubernetes GPU clusters
  • Strong background in C++, CUDA, and Python

Work Rights

Not specified

Tailored Resume

Cover Letter