Manager, Site Reliability Engineer - Dgx Cloud

NVIDIA

Site reliability engineering leadership
Kubernetes administration and containerization
Cloud environment operations (aws, gcp, azure)
NVIDIA is building the future of AI and high-performance computing with cloud platforms at the core of this transformation

Job Summary

  • NVIDIA is building the future of AI and high-performance computing with cloud platforms at the core of this transformation.
  • As a Senior Manager of SRE, you will lead a talented team to build robust systems, automate operations, and drive continuous improvement.
  • NVIDIA offers highly competitive salaries and a comprehensive benefits package for employees and their families.

Matching Summary

NVIDIA is building the future of AI and high-performance computing with cloud platforms at the core of this transformation.

Skills & Requirements

Must-have

  • Site Reliability Engineering leadership
  • Kubernetes administration and containerization
  • Cloud environment operations (AWS, GCP, Azure)
  • Infrastructure automation tools
  • SRE principles and incident management
  • Programming in Python or Go
  • Observability platforms implementation

Nice-to-have

  • Mentorship and team development
  • Collaboration with engineering and product teams
  • Data-driven operational improvements
  • Blameless post-mortem culture
  • Security standards and compliance
  • Excellent communication skills

Key Requirements

  • Bachelor's or Master's degree in related technical field
  • 10+ years experience in SRE or DevOps
  • 5+ years leadership/management experience
  • Experience with large-scale distributed cloud systems
  • Knowledge of Linux, networking, and cloud security standards

Work Rights

Not specified

Tailored Resume

Cover Letter