Llm Pre-training & Distributed Engineer (ai Infrastructure)

Hyphen Connect

Seattle, United States
On-site
Deep expertise in 3d parallelism data tensor pipeline
Experience managing slurm or kubernetes-based gpu clusters
Strong systems engineering background c++ cuda python
This role is essential for orchestrating large-scale machine learning training runs and optimizing distributed infrastructure

Job Summary

  • This role is essential for orchestrating large-scale machine learning training runs and optimizing distributed infrastructure.
  • The ideal candidate will have a deep understanding of GPU clusters and extensive experience in system engineering to ensure efficient and reliable training processes.
  • Responsibilities include automating checkpointing and failure recovery during month-long training runs.

Matching Summary

This role is essential for orchestrating large-scale machine learning training runs and optimizing distributed infrastructure.

Skills & Requirements

Must-have

  • Deep expertise in 3D parallelism Data Tensor Pipeline
  • Experience managing SLURM or Kubernetes-based GPU clusters
  • Strong systems engineering background C++ CUDA Python
  • Orchestrate distributed training runs across 1,000+ GPUs
  • Optimize networking InfiniBand RDMA and memory management

Nice-to-have

  • Extensive experience in system engineering
  • Deep understanding of GPU clusters
  • Ensuring efficient and reliable training processes

Key Requirements

  • Deep expertise in 3D parallelism
  • Experience managing SLURM or Kubernetes-based GPU clusters
  • Strong systems engineering background including C++, CUDA, Python

Work Rights

Not specified

Tailored Resume

Cover Letter