Principal Machine Learning Infrastructure Engineer

PhysicsX

London, United Kingdom
On-site
Distributed training infrastructure
Nvidia dgx b200 platform
Experiment tracking and observability
The Principal ML Infrastructure Engineer will extend and operate the infrastructure that powers our research model training, fine-tuning, and serving pipelines

Job Summary

  • The Principal ML Infrastructure Engineer will extend and operate the infrastructure that powers our research model training, fine-tuning, and serving pipelines.
  • You will have end-to-end responsibilities over the research infrastructure, with the autonomy to make architectural decisions and the responsibility to keep data flowing reliably.
  • We offer equity options, 10% employer pension contribution, free office lunches, enhanced parental leave, and private medical insurance.

Matching Summary

The Principal ML Infrastructure Engineer will extend and operate the infrastructure that powers our research model training, fine-tuning, and serving pipelines.

Skills & Requirements

Must-have

  • Distributed training infrastructure
  • NVIDIA DGX B200 platform
  • Experiment tracking and observability
  • Data loading bottlenecks
  • Model serving infrastructure
  • Reproducible model checkpoints
  • Developer experience improvement

Nice-to-have

  • Geometric deep learning
  • HPC for simulation engineering
  • Latency and throughput requirements
  • Experiment tracking tools
  • Model packaging for deployment

Key Requirements

  • 5+ years of experience building and operating ML infrastructure
  • Deep expertise in distributed training
  • Strong systems fundamentals
  • Production experience with Kubernetes and SLURM
  • Proficiency in Python and PyTorch
  • Experience with cloud GPU infrastructure

Work Rights

Not specified

Tailored Resume

Cover Letter