Senior Software Engineer, Ml Platform (stability & Infrastructure)
Isomorphic Labs
London, United Kingdom
On-site
Large-scale ai/ml production workloads
Google cloud platform (gcp) expertise
Kubernetes (gke) orchestration
You will play a pivotal role in ensuring the reliability and scalability of the foundations that make this possible
Job Summary
You will play a pivotal role in ensuring the reliability and scalability of the foundations that make this possible.
You will own the end-to-end strategy for platform reliability, with a specific focus on our accelerator (GPU/TPU) infrastructure and workload orchestration.
Overhaul our logging and monitoring systems to provide radical visibility.
Matching Summary
You will play a pivotal role in ensuring the reliability and scalability of the foundations that make this possible.
Skills & Requirements
Must-have
large-scale AI/ML production workloads
Google Cloud Platform (GCP) expertise
Kubernetes (GKE) orchestration
NVIDIA GPU generations
reliability-first software development
Nice-to-have
ML Software Engineering and Infrastructure SRE
leading multi-disciplinary projects
workload scheduling and ML efficiency research
Google TPU generations
Key Requirements
Proven experience in architecting and managing large-scale AI/ML workloads
Expertise in cloud compute design within Google Cloud Platform (GCP)
Significant experience deploying and managing complex workloads within Kubernetes (GKE)
Professional familiarity with NVIDIA GPU generations