Site Reliability Engineer - Nyc

Mistral AI

New York, NY, United States
On-site
7+ years devops or sre experience
Kubernetes, docker, and containerization
Terraform or cloudformation infrastructure as code
Mistral AI is seeking experienced Site Reliability Engineers to shape the reliability, scalability, and performance of their cutting-edge AI platform

Job Summary

  • Mistral AI is seeking experienced Site Reliability Engineers to shape the reliability, scalability, and performance of their cutting-edge AI platform.
  • The role involves balancing daily production operations with long-term software engineering improvements to reduce operational toil across HPC clusters.
  • Candidates will collaborate with AI researchers to develop safe, reproducible model-training experiments and ensure infrastructure adheres to security best practices.

Matching Summary

Mistral AI is seeking experienced Site Reliability Engineers to shape the reliability, scalability, and performance of their cutting-edge AI platform.

Skills & Requirements

Must-have

  • 7+ years DevOps or SRE experience
  • Kubernetes, Docker, and containerization
  • Terraform or CloudFormation infrastructure as code
  • Prometheus, Grafana, ELK Stack, or Datadog monitoring
  • Python, Go, or Bash scripting proficiency
  • CI/CD pipeline implementation and maintenance
  • Root cause analysis and incident response

Nice-to-have

  • Experience with AI/ML environments
  • High-performance computing (HPC) systems knowledge
  • Familiarity with Slurm workload managers
  • Experience with Fluidstack, Coreweave, or Vast
  • Contribution to open-source projects
  • Low-ego and team-spirited culture fit
  • Cloud-agnostic platform abstraction layer design

Key Requirements

  • Master's degree in Computer Science or Engineering
  • 7+ years of experience in a DevOps/SRE role
  • Strong understanding of networking and system administration

Work Rights

Not specified

Tailored Resume

Cover Letter