Own SRE solutions end-to-end, from design and implementation to operation and continuous improvement, ensuring they integrate cleanly with HPC schedulers, storage, and network fabrics
Job Summary
Own SRE solutions end-to-end, from design and implementation to operation and continuous improvement, ensuring they integrate cleanly with HPC schedulers, storage, and network fabrics.
Deliver solutions in a globally distributed, multi-cloud hybrid environment – On-prem, AWS, GCP, and OCI, designing for failure with redundancy, failure domains, progressive delivery, and strict change control.
NVIDIA offers highly competitive salaries and a comprehensive benefits package, fostering a diverse work environment and proud to be an equal opportunity employer.
Matching Summary
Own SRE solutions end-to-end, from design and implementation to operation and continuous improvement, ensuring they integrate cleanly with HPC schedulers, storage, and network fabrics.
Skills & Requirements
Must-have
Kubernetes cluster design and support
Infrastructure as Code (IaC)
CI/CD techniques
multi-cloud hybrid environment
monitoring, metrics, container management
coding/scripting in Python, Go, Perl, or Ruby
Nice-to-have
AI for groundbreaking solutions
creative problem solver
strong communication and documentation
Key Requirements
4+ years building and supporting critical services
B.S. degree in Computer Science or equivalent experience
Experience with large-scale multi-tenant Kubernetes
Experience building Kubernetes controllers
Experience with automated host lifecycle management