Principal Machine Learning Infrastructure Engineer
physicsbirds.dev
London, United Kingdom
On-site
Design and operate distributed training infrastructure
Optimize training pipelines for throughput
Build and maintain experiment tracking
The Principal ML Infrastructure Engineer will extend and operate the infrastructure that powers our research model training, fine-tuning, and serving pipelines
Job Summary
The Principal ML Infrastructure Engineer will extend and operate the infrastructure that powers our research model training, fine-tuning, and serving pipelines.
You will have end-to-end responsibilities over the research infrastructure, with the autonomy to make architectural decisions and the responsibility to keep data flowing reliably.
We operate with a flat structure: good ideas win - wherever they come from.
Matching Summary
The Principal ML Infrastructure Engineer will extend and operate the infrastructure that powers our research model training, fine-tuning, and serving pipelines.
Skills & Requirements
Must-have
Design and operate distributed training infrastructure
Optimize training pipelines for throughput
Build and maintain experiment tracking
Solve data loading bottlenecks
Build serving infrastructure for LPMs
Improve developer experience for Research team
Deep expertise in distributed training
Strong systems fundamentals
Production experience with Kubernetes and SLURM
Proficiency in Python and ML frameworks
Nice-to-have
Experience with geometric deep learning
Background in HPC for simulation engineering
Experience building model serving infrastructure
Familiarity with experiment tracking tools
Experience packaging models for deployment
Key Requirements
5+ years of experience building and operating ML infrastructure