Site Reliability Engineer (sre) - Ai Platform & Cloud
Morgan Stanley
Not specified; not specified; comprehensive employ...
**
5 years production sre experience
Kubernetes and container orchestration
Infrastructure-as-code with terraform or helm
**
Morgan Stanley is seeking a Site Reliability Engineer (SRE) for their AI Platform team, focusing on developing and maintaining infrastructure for AI/ML systems in a high-stakes financial environment. The ideal candidate should have extensive experience with cloud technologies, container orchestration, and a passion for leveraging AI in their work.
**
Job Summary
This role focuses on building and maintaining the infrastructure that powers AI/ML systems within a global financial services firm.
The successful candidate will collaborate with cross-functional teams to ensure availability, reliability, and security of production AI workloads.
Morgan Stanley offers a diverse, inclusive environment with comprehensive benefits and opportunities for career growth across 1,200 global offices.
Matching Summary
Match Score: 75
**
Morgan Stanley is seeking a Site Reliability Engineer (SRE) for their AI Platform team, focusing on developing and maintaining infrastructure for AI/ML systems in a high-stakes financial environment. The ideal candidate should have extensive experience with cloud technologies, container orchestration, and a passion for leveraging AI in their work.
**
Salary
Not specified; Not specified; Comprehensive employee benefits and perks described as attractive
Skills & Requirements
Must-have
5 years production SRE experience
Kubernetes and container orchestration
Infrastructure-as-code with Terraform or Helm
Python, Go, or Java scripting skills
Monitoring tools like Prometheus and Grafana
Incident response and root cause analysis
GPU cluster and distributed architecture knowledge
Nice-to-have
Generative AI and LLM model experience
Experience in regulated financial environments
Chaos engineering and canary deployment skills
Strong communication and team collaboration
Knowledge of ModelOps and MLOps practices
High-performance computing scheduling expertise
Key Requirements
Bachelor's or Master's degree in Computer Science
5 years of production experience in SRE or Infrastructure
Deep experience with Docker and Kubernetes
Solid experience in capacity planning and scaling
Demonstrated ability to lead RCAs and drive reliability improvements