Senior Software Engineer, Ml Platform (stability & Infrastructure)

Isomorphic Labs

London, United Kingdom
On-site
Large-scale ai/ml production workloads
Google cloud platform (gcp) expertise
Kubernetes (gke) orchestration
You will play a pivotal role in ensuring the reliability and scalability of the foundations that make this possible

Job Summary

  • You will play a pivotal role in ensuring the reliability and scalability of the foundations that make this possible.
  • You will own the end-to-end strategy for platform reliability, with a specific focus on our accelerator (GPU/TPU) infrastructure and workload orchestration.
  • Overhaul our logging and monitoring systems to provide radical visibility.

Matching Summary

You will play a pivotal role in ensuring the reliability and scalability of the foundations that make this possible.

Skills & Requirements

Must-have

  • large-scale AI/ML production workloads
  • Google Cloud Platform (GCP) expertise
  • Kubernetes (GKE) orchestration
  • NVIDIA GPU generations
  • reliability-first software development

Nice-to-have

  • ML Software Engineering and Infrastructure SRE
  • leading multi-disciplinary projects
  • workload scheduling and ML efficiency research
  • Google TPU generations

Key Requirements

  • Proven experience in architecting and managing large-scale AI/ML workloads
  • Expertise in cloud compute design within Google Cloud Platform (GCP)
  • Significant experience deploying and managing complex workloads within Kubernetes (GKE)
  • Professional familiarity with NVIDIA GPU generations
  • Strong programming skills

Work Rights

Not specified

Tailored Resume

Cover Letter