NVIDIA is seeking a Senior Site Reliability Engineer (SRE) to join their Compute Farm team in Bengaluru, India. The ideal candidate will be responsible for ensuring the reliability and performance of critical systems, leveraging their expertise in Kubernetes, Infrastructure as Code, and cloud environments
Job Summary
Own SRE solutions end‑to‑end, from design and implementation to operation and continuous improvement, ensuring they integrate cleanly with HPC schedulers, storage, and network fabrics.
Deliver solutions in a globally distributed, multi‑cloud hybrid environment – On‑prem, AWS, GCP, and OCI.
NVIDIA offers highly competitive salaries and a comprehensive benefits package.
Matching Summary
Match Score: 85
NVIDIA is seeking a Senior Site Reliability Engineer (SRE) to join their Compute Farm team in Bengaluru, India. The ideal candidate will be responsible for ensuring the reliability and performance of critical systems, leveraging their expertise in Kubernetes, Infrastructure as Code, and cloud environments.
Skills & Requirements
Must-have
end-to-end SRE solutions
Infrastructure-as-Code (IaC)
multi-cloud hybrid environment
Kubernetes cluster design and support
CI/CD techniques
observability and AIOps
Nice-to-have
HPC cluster support (Slurm/LSF)
open source contributions
technical writing and speaking
Key Requirements
B.S. degree or equivalent experience
over 4 years in building and supporting critical services