Senior Platform And Engops Engineer - Cluster Operations

Nvidia Corporation

Base: $176,000 - $333,500 usd depending on level; ...
8+ years cluster administration experience
Ansible python shell scripting automation
Linux fundamentals and os knowledge
NVIDIA is seeking highly motivated EngOps engineers to manage and maintain extensive GPU clusters interconnected via NVLink and InfiniBand

Job Summary

  • NVIDIA is seeking highly motivated EngOps engineers to manage and maintain extensive GPU clusters interconnected via NVLink and InfiniBand.
  • The role involves developing automated tools for deployment, provisioning, and maintenance while taking ownership of daily cluster failures.
  • Candidates will collaborate with dynamic Engineering teams across multiple time zones to align operations with evolving project requirements.

Matching Summary

NVIDIA is seeking highly motivated EngOps engineers to manage and maintain extensive GPU clusters interconnected via NVLink and InfiniBand.

Salary

Base: $176,000 - $333,500 USD depending on level; Equity: Eligible; Benefits: Eligible

Skills & Requirements

Must-have

  • 8+ years cluster administration experience
  • Ansible Python Shell Scripting automation
  • Linux fundamentals and OS knowledge
  • High-performance computing network understanding

Nice-to-have

  • Slurm resource scheduling manager familiarity
  • DGX systems and GPU hardware experience
  • Robust metrics collection infrastructure design
  • Emergency response practices knowledge

Key Requirements

  • BS or MS in Computer Science or related field
  • 8+ years hands-on experience in cluster deployment
  • Proficiency in Linux fundamentals

Work Rights

Not specified

Tailored Resume

Cover Letter