Experience with gpu cluster architecture and reliability
The role focuses on scaling research infrastructure to support massive, frontier-scale training runs across pre-training, mid-training, and post-training workstreams
Job Summary
The role focuses on scaling research infrastructure to support massive, frontier-scale training runs across pre-training, mid-training, and post-training workstreams.
You will act as a force multiplier who brings clarity to ambiguity, drives decisions when the path forward is unclear, and ensures cross-team coordination.
The company offers top-tier compensation including salary and equity, comprehensive health benefits, fully paid parental leave, and daily provided meals.
Matching Summary
The role focuses on scaling research infrastructure to support massive, frontier-scale training runs across pre-training, mid-training, and post-training workstreams.
Salary
Top-tier compensation; Salary and equity structured to recognize talent; Not specified
Skills & Requirements
Must-have
7+ years technical program management experience
Deep knowledge of distributed training frameworks
Experience with GPU cluster architecture and reliability
Ability to triage active incidents and drive resolution
Proven track record in high-ambiguity environments
Nice-to-have
First-responder mentality during system failures
Championing blameless post-mortem culture
Building processes from zero to one
Translating technical complexity for executives
Experience with Megatron or similar libraries
Key Requirements
7+ years of experience in technical program management