You will be leading the team of site reliability engineers responsible for automating maintenance of 10000+ hosts and providing support to customers towards debugging workflows
Job Summary
You will be leading the team of site reliability engineers responsible for automating maintenance of 10000+ hosts and providing support to customers towards debugging workflows.
Reuse AI techniques and data analytics to extract useful signals about machines and jobs to ensure high availability and resiliency of the systems in the data center.
NVIDIA is widely considered to be one of the technology world’s most desirable employers with some of the most brilliant and talented people in the world working for us.
Matching Summary
You will be leading the team of site reliability engineers responsible for automating maintenance of 10000+ hosts and providing support to customers towards debugging workflows.
Salary
Base: 208,000 USD - 333,500 USD; Bonus/Equity: Eligible for equity; Benefits: Eligible for benefits
Skills & Requirements
Must-have
Python and scripting programming
Large scale cloud infrastructure
Service level agreement maintenance
Debugging and problem solving
Agile process and methodologies
People management experience
Nice-to-have
Data center management experience
Computer algorithms expertise
Cross-time zone collaboration
Continuous improvement focus
Designing simple reliable systems
Key Requirements
8+ years industry experience
2+ years people management experience
BS/MS in Computer Science or equivalent experience