Site Reliability Engineer (sre) - Ai Platform & Cloud

Morgan Stanley

Not specified; not specified; comprehensive employ...
**
5 years production sre experience
Kubernetes and container orchestration
Infrastructure-as-code with terraform or helm
** Morgan Stanley is seeking a Site Reliability Engineer (SRE) for their AI Platform team, focusing on developing and maintaining infrastructure for AI/ML systems in a high-stakes financial environment. The ideal candidate should have extensive experience with cloud technologies, container orchestration, and a passion for leveraging AI in their work. **

Job Summary

  • This role focuses on building and maintaining the infrastructure that powers AI/ML systems within a global financial services firm.
  • The successful candidate will collaborate with cross-functional teams to ensure availability, reliability, and security of production AI workloads.
  • Morgan Stanley offers a diverse, inclusive environment with comprehensive benefits and opportunities for career growth across 1,200 global offices.

Matching Summary

Match Score: 75

** Morgan Stanley is seeking a Site Reliability Engineer (SRE) for their AI Platform team, focusing on developing and maintaining infrastructure for AI/ML systems in a high-stakes financial environment. The ideal candidate should have extensive experience with cloud technologies, container orchestration, and a passion for leveraging AI in their work. **

Salary

Not specified; Not specified; Comprehensive employee benefits and perks described as attractive

Skills & Requirements

Must-have

  • 5 years production SRE experience
  • Kubernetes and container orchestration
  • Infrastructure-as-code with Terraform or Helm
  • Python, Go, or Java scripting skills
  • Monitoring tools like Prometheus and Grafana
  • Incident response and root cause analysis
  • GPU cluster and distributed architecture knowledge

Nice-to-have

  • Generative AI and LLM model experience
  • Experience in regulated financial environments
  • Chaos engineering and canary deployment skills
  • Strong communication and team collaboration
  • Knowledge of ModelOps and MLOps practices
  • High-performance computing scheduling expertise

Key Requirements

  • Bachelor's or Master's degree in Computer Science
  • 5 years of production experience in SRE or Infrastructure
  • Deep experience with Docker and Kubernetes
  • Solid experience in capacity planning and scaling
  • Demonstrated ability to lead RCAs and drive reliability improvements

Work Rights

Not specified

Tailored Resume

Cover Letter