Terraform or cloudformation infrastructure as code
Mistral AI is seeking experienced Site Reliability Engineers to shape the reliability, scalability, and performance of their cutting-edge AI platform
Job Summary
Mistral AI is seeking experienced Site Reliability Engineers to shape the reliability, scalability, and performance of their cutting-edge AI platform.
The role involves balancing daily production operations with long-term software engineering improvements to reduce operational toil across HPC clusters.
Candidates will collaborate with AI researchers to develop safe, reproducible model-training experiments and ensure infrastructure adheres to security best practices.
Matching Summary
Mistral AI is seeking experienced Site Reliability Engineers to shape the reliability, scalability, and performance of their cutting-edge AI platform.
Skills & Requirements
Must-have
7+ years DevOps or SRE experience
Kubernetes, Docker, and containerization
Terraform or CloudFormation infrastructure as code
Prometheus, Grafana, ELK Stack, or Datadog monitoring
Python, Go, or Bash scripting proficiency
CI/CD pipeline implementation and maintenance
Root cause analysis and incident response
Nice-to-have
Experience with AI/ML environments
High-performance computing (HPC) systems knowledge
Familiarity with Slurm workload managers
Experience with Fluidstack, Coreweave, or Vast
Contribution to open-source projects
Low-ego and team-spirited culture fit
Cloud-agnostic platform abstraction layer design
Key Requirements
Master's degree in Computer Science or Engineering
7+ years of experience in a DevOps/SRE role
Strong understanding of networking and system administration