Senior Site Reliability Engineer - Hpc

NVIDIA

Bengaluru, India
Hybrid
Kubernetes cluster design and support
Infrastructure as code (iac)
Ci/cd techniques
Own SRE solutions end-to-end, from design and implementation to operation and continuous improvement, ensuring they integrate cleanly with HPC schedulers, storage, and network fabrics

Job Summary

  • Own SRE solutions end-to-end, from design and implementation to operation and continuous improvement, ensuring they integrate cleanly with HPC schedulers, storage, and network fabrics.
  • Deliver solutions in a globally distributed, multi-cloud hybrid environment – On-prem, AWS, GCP, and OCI, designing for failure with redundancy, failure domains, progressive delivery, and strict change control.
  • NVIDIA offers highly competitive salaries and a comprehensive benefits package, fostering a diverse work environment and proud to be an equal opportunity employer.

Matching Summary

Own SRE solutions end-to-end, from design and implementation to operation and continuous improvement, ensuring they integrate cleanly with HPC schedulers, storage, and network fabrics.

Skills & Requirements

Must-have

  • Kubernetes cluster design and support
  • Infrastructure as Code (IaC)
  • CI/CD techniques
  • multi-cloud hybrid environment
  • monitoring, metrics, container management
  • coding/scripting in Python, Go, Perl, or Ruby

Nice-to-have

  • AI for groundbreaking solutions
  • creative problem solver
  • strong communication and documentation

Key Requirements

  • 4+ years building and supporting critical services
  • B.S. degree in Computer Science or equivalent experience
  • Experience with large-scale multi-tenant Kubernetes
  • Experience building Kubernetes controllers
  • Experience with automated host lifecycle management

Work Rights

Not specified

Tailored Resume

Cover Letter