Senior Ai Infrastructure Engineer - Training Platform

Scale

San Francisco, CA, US
Base: $216,000 - $270,000 usd; equity: subject to ...
On-site
5+ years backend or infrastructure engineering experience
2+ years orchestrating ml workloads at scale
Expert-level kubernetes internals knowledge
You will architect a high-performance training platform that handles the immense complexity of multi-thousand GPU workloads to ensure efficient compute usage

Job Summary

  • You will architect a high-performance training platform that handles the immense complexity of multi-thousand GPU workloads to ensure efficient compute usage.
  • The role involves partnering closely with researchers to build a seamless environment that transforms raw compute into breakthrough AI models.
  • Compensation includes a base salary range of $216,000 to $270,000 USD along with equity, comprehensive health benefits, and a learning stipend.

Matching Summary

You will architect a high-performance training platform that handles the immense complexity of multi-thousand GPU workloads to ensure efficient compute usage.

Salary

Base: $216,000 - $270,000 USD; Equity: Subject to Board approval; Benefits: Comprehensive health, dental, vision, retirement, PTO, and learning stipend

Skills & Requirements

Must-have

  • 5+ years backend or infrastructure engineering experience
  • 2+ years orchestrating ML workloads at scale
  • Expert-level Kubernetes internals knowledge
  • Experience with distributed storage systems like Lustre
  • Strong programming skills in Python, Go, Rust, or C++

Nice-to-have

  • Experience with DeepSpeed or FSDP distributed training
  • Familiarity with NVIDIA CUDA and NCCL software stack
  • Knowledge of Reinforcement Learning algorithms like GRPO
  • Experience with PyTorch framework
  • Background in capacity planning with Finance teams

Key Requirements

  • 5+ years of backend or infrastructure engineering experience
  • At least 2 years focused on orchestrating ML workloads at scale
  • Expert-level knowledge of Kubernetes internals and device plugins

Work Rights

Not specified

Tailored Resume

Cover Letter