Site Reliability Engineer (sre) - Ai Platform & Cloud

Morgan Stanley UK

**
5 years production experience in sre
Deep kubernetes and containerization expertise
Infrastructure-as-code with terraform or helm
** Morgan Stanley is seeking a Site Reliability Engineer (SRE) for its AI Platform team to develop and maintain software solutions that support AI/ML systems in a regulated financial environment. The role requires significant experience in operations, automation, and systems engineering, with a strong emphasis on collaboration across multiple teams. **

Job Summary

  • This role is part of a firmwide initiative to build an Artificial Intelligence Development Platform that drives efficiency, security, and innovation across the organization.
  • The successful candidate will collaborate with infrastructure, cloud, data, and security teams to ensure the availability and reliability of production AI workloads in a high-stakes financial environment.
  • Morgan Stanley offers a diverse, inclusive culture where employees are empowered to achieve their full potential alongside some of the best and brightest in the industry.

Matching Summary

Match Score: 75

** Morgan Stanley is seeking a Site Reliability Engineer (SRE) for its AI Platform team to develop and maintain software solutions that support AI/ML systems in a regulated financial environment. The role requires significant experience in operations, automation, and systems engineering, with a strong emphasis on collaboration across multiple teams. **

Skills & Requirements

Must-have

  • 5 years production experience in SRE
  • Deep Kubernetes and containerization expertise
  • Infrastructure-as-code with Terraform or Helm
  • Python, Go, or Java programming skills
  • Monitoring tools like Prometheus and Grafana
  • Experience with GPU clusters and AI workloads

Nice-to-have

  • Generative AI development and fine-tuning experience
  • Knowledge of ModelOps and LLM Ops practices
  • Experience with chaos engineering and canary deployments
  • Background in regulated financial environments
  • Proficiency with OpenTelemetry and distributed tracing

Key Requirements

  • Bachelor's or Master's degree in Computer Science
  • 5 years of production SRE or Infrastructure experience
  • Strong background in large-scale system operations

Work Rights

Not specified

Tailored Resume

Cover Letter