Senior Ai Infrastructure Engineer - Training Platform

engineers.ai

San Francisco, CA, US
Base: $216,000 - $270,000 usd; equity: subject to ...
On-site
5+ years backend or infrastructure engineering experience
2+ years orchestrating ml workloads at scale
Expert-level kubernetes internals knowledge
The role involves architecting a high-performance training platform that serves as the operating system for massive GPU clusters handling thousands of GPUs

Job Summary

  • The role involves architecting a high-performance training platform that serves as the operating system for massive GPU clusters handling thousands of GPUs.
  • Candidates will partner closely with researchers to build a resilient environment that transforms raw compute into breakthrough AI models.
  • Compensation includes a base salary range of $216,000 to $270,000 USD along with equity, comprehensive health benefits, and a learning stipend.

Matching Summary

The role involves architecting a high-performance training platform that serves as the operating system for massive GPU clusters handling thousands of GPUs.

Salary

Base: $216,000 - $270,000 USD; Equity: Subject to Board approval; Benefits: Comprehensive health, dental, vision, retirement, learning stipend, PTO

Skills & Requirements

Must-have

  • 5+ years backend or infrastructure engineering experience
  • 2+ years orchestrating ML workloads at scale
  • Expert-level Kubernetes internals knowledge
  • Experience with distributed storage systems like Lustre or S3
  • Strong programming skills in Python, Go, Rust, or C++

Nice-to-have

  • Experience with DeepSpeed or FSDP distributed training techniques
  • Familiarity with NVIDIA CUDA and NCCL software stack
  • Knowledge of Reinforcement Learning algorithms like GRPO
  • Experience with PyTorch framework
  • Background in topology-aware scheduling and EFA/Infiniband

Key Requirements

  • 5+ years of backend or infrastructure engineering experience
  • At least 2 years focused on orchestrating ML workloads at scale (100+ GPU nodes)
  • Expert-level knowledge of Kubernetes internals including Custom Resources and Operators

Work Rights

Not specified

Tailored Resume

Cover Letter