Senior Platform And Engops Engineer - Cluster Operations

Invidia

Us, CA, United States
Base: 176,000 usd - 276,000 usd for level 4, 208,0...
Automated gpu cluster deployment
Devops automation tools
Cluster failure troubleshooting
NVIDIA is leading groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization

Job Summary

  • NVIDIA is leading groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization.
  • The role involves developing automated tools to deploy and maintain GPU clusters interconnected via NVLink and InfiniBand, and managing cluster software and firmware updates.
  • Employees are eligible for a competitive base salary, equity, and benefits, with a commitment to fostering a diverse and inclusive work environment.

Matching Summary

NVIDIA is leading groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization.

Salary

Base: 176,000 USD - 276,000 USD for Level 4, 208,000 USD - 333,500 USD for Level 5; Bonus/Equity: Eligible for equity; Benefits: Eligible for benefits

Skills & Requirements

Must-have

  • Automated GPU cluster deployment
  • DevOps automation tools
  • Cluster failure troubleshooting
  • Linux fundamentals proficiency
  • Ansible, Python and Shell scripting
  • High-performance computing infrastructure

Nice-to-have

  • Resource scheduling managers knowledge
  • Industry standard alerting tools
  • GPU-focused hardware experience
  • Metrics collection and alerting
  • Large scale networking design

Key Requirements

  • BS or MS in Computer Science or related field
  • 8+ years cluster deployment experience
  • Proven cross-team collaboration skills

Work Rights

Not specified

Tailored Resume

Cover Letter