Senior Ai Infrastructure Engineer - Training Platform

spacedefense.ai

San Francisco, CA, US
Base: $216,000 - $270,000 usd; bonus/equity: equit...
On-site
5+ years backend or infrastructure engineering experience
2+ years orchestrating ml workloads at scale
Expert-level knowledge of kubernetes internals
You will architect a high-performance training platform that handles the immense complexity of multi-thousand GPU workloads

Job Summary

  • You will architect a high-performance training platform that handles the immense complexity of multi-thousand GPU workloads.
  • The role involves designing scheduling primitives to optimize the lifecycle of training jobs while ensuring high utilization.
  • Compensation includes base salary ranging from $216,000 to $270,000 USD along with equity and comprehensive benefits.

Matching Summary

You will architect a high-performance training platform that handles the immense complexity of multi-thousand GPU workloads.

Salary

Base: $216,000 - $270,000 USD; Bonus/Equity: Equity grant subject to Board approval; Benefits: Comprehensive health, dental, vision, retirement, learning stipend, PTO

Skills & Requirements

Must-have

  • 5+ years backend or infrastructure engineering experience
  • 2+ years orchestrating ML workloads at scale
  • Expert-level knowledge of Kubernetes internals
  • Experience with distributed storage systems like Lustre or S3
  • Strong programming skills in Python, Go, Rust, or C++

Nice-to-have

  • Experience with DeepSpeed or FSDP distributed training techniques
  • Familiarity with NVIDIA software stack including CUDA and NCCL
  • Knowledge of Reinforcement Learning algorithms like GRPO
  • Experience with PyTorch framework
  • Background in post-training algorithms

Key Requirements

  • 5+ years of experience in backend or infrastructure engineering
  • At least 2 years focused on orchestrating ML workloads at scale
  • Expert-level knowledge of Kubernetes internals and device plugins

Work Rights

Not specified

Tailored Resume

Cover Letter