Principal Machine Learning Infrastructure Engineer
PhysicsX
London, United Kingdom
On-site
Distributed training infrastructure
Nvidia dgx b200 platform
Experiment tracking and observability
The Principal ML Infrastructure Engineer will extend and operate the infrastructure that powers our research model training, fine-tuning, and serving pipelines
Job Summary
The Principal ML Infrastructure Engineer will extend and operate the infrastructure that powers our research model training, fine-tuning, and serving pipelines.
You will have end-to-end responsibilities over the research infrastructure, with the autonomy to make architectural decisions and the responsibility to keep data flowing reliably.
We offer equity options, 10% employer pension contribution, free office lunches, enhanced parental leave, and private medical insurance.
Matching Summary
The Principal ML Infrastructure Engineer will extend and operate the infrastructure that powers our research model training, fine-tuning, and serving pipelines.
Skills & Requirements
Must-have
Distributed training infrastructure
NVIDIA DGX B200 platform
Experiment tracking and observability
Data loading bottlenecks
Model serving infrastructure
Reproducible model checkpoints
Developer experience improvement
Nice-to-have
Geometric deep learning
HPC for simulation engineering
Latency and throughput requirements
Experiment tracking tools
Model packaging for deployment
Key Requirements
5+ years of experience building and operating ML infrastructure