Cloud Site Reliability Engineer

Sambanovasystems

San Jose, United States
Competitive compensation; equity included; excelle...
On-site
3-5+ years sre or devops experience
Python, go, or java programming skills
Docker and kubernetes orchestration
This role serves as the guardian of reliability, performance, and scalability for the company's generative AI inferencing service

Job Summary

  • This role serves as the guardian of reliability, performance, and scalability for the company's generative AI inferencing service.
  • The team utilizes a balanced on-call rotation with a focus on prevention through automation to minimize alert fatigue.
  • Candidates will work with cutting-edge technology including the SN40L chip and SambaFlow software to push the boundaries of AI computing.

Matching Summary

This role serves as the guardian of reliability, performance, and scalability for the company's generative AI inferencing service.

Salary

Competitive compensation; Equity included; Excellent benefits

Skills & Requirements

Must-have

  • 3-5+ years SRE or DevOps experience
  • Python, Go, or Java programming skills
  • Docker and Kubernetes orchestration
  • Prometheus, Grafana, or Datadog monitoring
  • Terraform or CloudFormation Infrastructure as Code
  • AWS, GCP, or Azure public cloud environment

Nice-to-have

  • Hybrid cloud and on-premise infrastructure experience
  • Production ML/AI inferencing service support
  • NVIDIA GPU workload optimization
  • vLLM, SGLang, or Ray model serving frameworks
  • MLOps principles and practices
  • Redis or Memcached caching systems
  • Linux/Unix system administration fundamentals

Key Requirements

  • Bachelor's degree in Computer Science or related field
  • 3-5+ years of relevant SRE or DevOps experience
  • Strong problem-solving skills for distributed systems

Work Rights

Not specified

Tailored Resume

Cover Letter