Site Reliability Engineer - Nyc

Mistral AI

New York, NY, United States
On-site
7+ years devops or sre experience
Kubernetes, docker, and orchestration tools
Infrastructure as code with terraform
The role involves balancing daily production operations with long-term engineering improvements to reduce toil and enhance system reliability

Job Summary

  • The role involves balancing daily production operations with long-term engineering improvements to reduce toil and enhance system reliability.
  • Candidates will collaborate closely with software engineers and AI researchers to build a cloud-agnostic platform supporting model training and inference.
  • The team is described as dynamic, collaborative, and committed to democratizing AI through high-performance, open-source solutions.

Matching Summary

The role involves balancing daily production operations with long-term engineering improvements to reduce toil and enhance system reliability.

Skills & Requirements

Must-have

  • 7+ years DevOps or SRE experience
  • Kubernetes, Docker, and orchestration tools
  • Infrastructure as code with Terraform
  • Prometheus, Grafana, or Datadog monitoring
  • Python, Go, or Bash scripting proficiency
  • CI/CD pipeline implementation and maintenance
  • Root cause analysis in production environments

Nice-to-have

  • Experience with AI/ML research environments
  • High-performance computing (HPC) systems knowledge
  • Familiarity with Slurm workload managers
  • Background with Fluidstack, Coreweave, or Vast
  • Contribution to open-source projects
  • Strong communication in fast-paced startup culture

Key Requirements

  • Master's degree in Computer Science or Engineering
  • 7+ years of experience in DevOps/SRE roles
  • Experience with critical environment reliability KPIs

Work Rights

Not specified

Tailored Resume

Cover Letter