Senior Site Reliability Engineer - Hpc

Nvidia Corporation

Bengaluru, India
On-site
End-to-end sre solutions
Infrastructure-as-code (iac)
Multi-cloud hybrid environment
NVIDIA is seeking a Senior Site Reliability Engineer (SRE) to join their Compute Farm team in Bengaluru, India. The ideal candidate will be responsible for ensuring the reliability and performance of critical systems, leveraging their expertise in Kubernetes, Infrastructure as Code, and cloud environments

Job Summary

  • Own SRE solutions end‑to‑end, from design and implementation to operation and continuous improvement, ensuring they integrate cleanly with HPC schedulers, storage, and network fabrics.
  • Deliver solutions in a globally distributed, multi‑cloud hybrid environment – On‑prem, AWS, GCP, and OCI.
  • NVIDIA offers highly competitive salaries and a comprehensive benefits package.

Matching Summary

Match Score: 85

NVIDIA is seeking a Senior Site Reliability Engineer (SRE) to join their Compute Farm team in Bengaluru, India. The ideal candidate will be responsible for ensuring the reliability and performance of critical systems, leveraging their expertise in Kubernetes, Infrastructure as Code, and cloud environments.

Skills & Requirements

Must-have

  • end-to-end SRE solutions
  • Infrastructure-as-Code (IaC)
  • multi-cloud hybrid environment
  • Kubernetes cluster design and support
  • CI/CD techniques
  • observability and AIOps

Nice-to-have

  • HPC cluster support (Slurm/LSF)
  • open source contributions
  • technical writing and speaking

Key Requirements

  • B.S. degree or equivalent experience
  • over 4 years in building and supporting critical services
  • 4+ years of coding/scripting experience
  • Mentored other engineers

Work Rights

Not specified

Tailored Resume

Cover Letter