Senior Ai Infrastructure Engineer - Training Platform

Scale

San Francisco, CA, US
Base: $216,000 - $270,000 usd; bonus/equity: equit...
On-site
5+ years backend or infrastructure engineering experience
2+ years orchestrating ml workloads at scale
Expert-level kubernetes internals knowledge
You will architect a high-performance training platform that handles the immense complexity of multi-thousand GPU workloads

Job Summary

  • You will architect a high-performance training platform that handles the immense complexity of multi-thousand GPU workloads.
  • The role involves partnering closely with researchers to build a seamless, resilient environment that transforms raw compute into breakthrough AI.
  • Compensation packages include base salary ranging from $216,000 to $270,000 USD plus equity and comprehensive benefits.

Matching Summary

You will architect a high-performance training platform that handles the immense complexity of multi-thousand GPU workloads.

Salary

Base: $216,000 - $270,000 USD; Bonus/Equity: Equity grant subject to Board approval; Benefits: Comprehensive health, dental, vision, retirement, learning stipend, PTO

Skills & Requirements

Must-have

  • 5+ years backend or infrastructure engineering experience
  • 2+ years orchestrating ML workloads at scale
  • Expert-level Kubernetes internals knowledge
  • Experience with distributed storage systems like Lustre
  • Strong programming skills in Python, Go, Rust, or C++

Nice-to-have

  • Experience with DeepSpeed or FSDP distributed training
  • Familiarity with NVIDIA CUDA and NCCL software stack
  • Knowledge of Reinforcement Learning algorithms like GRPO
  • Experience with PyTorch framework
  • Background in topology-aware scheduling

Key Requirements

  • 5+ years of backend or infrastructure engineering experience
  • 2+ years focused on orchestrating ML workloads at scale
  • Expert-level knowledge of Kubernetes internals

Work Rights

Not specified

Tailored Resume

Cover Letter