The role involves balancing daily production operations with long-term engineering improvements to reduce toil and enhance system reliability
Job Summary
The role involves balancing daily production operations with long-term engineering improvements to reduce toil and enhance system reliability.
Candidates will collaborate closely with software engineers and AI researchers to build a cloud-agnostic platform supporting model training and inference.
The team is described as dynamic, collaborative, and committed to democratizing AI through high-performance, open-source solutions.
Matching Summary
The role involves balancing daily production operations with long-term engineering improvements to reduce toil and enhance system reliability.
Skills & Requirements
Must-have
7+ years DevOps or SRE experience
Kubernetes, Docker, and orchestration tools
Infrastructure as code with Terraform
Prometheus, Grafana, or Datadog monitoring
Python, Go, or Bash scripting proficiency
CI/CD pipeline implementation and maintenance
Root cause analysis in production environments
Nice-to-have
Experience with AI/ML research environments
High-performance computing (HPC) systems knowledge
Familiarity with Slurm workload managers
Background with Fluidstack, Coreweave, or Vast
Contribution to open-source projects
Strong communication in fast-paced startup culture
Key Requirements
Master's degree in Computer Science or Engineering
7+ years of experience in DevOps/SRE roles
Experience with critical environment reliability KPIs