Senior System Architect, Infrastructure Reliability

NVIDIA

Base: $184,000 - $287,500 (level 4) or $224,000 - ...
**
Distributed systems mastery with 6+ years experience
Expert knowledge of x86/arm node-level metrics
Strong c++ and python programming proficiency
** NVIDIA is hiring a Senior System Architect for Infrastructure Reliability to develop an automated framework aimed at identifying the root causes of job failures in real-time across heterogeneous computing nodes. The ideal candidate will have extensive experience in distributed systems, CPU architecture, and programming, particularly in C++ and Python. **

Matching Summary

Match Score: 75

** NVIDIA is hiring a Senior System Architect for Infrastructure Reliability to develop an automated framework aimed at identifying the root causes of job failures in real-time across heterogeneous computing nodes. The ideal candidate will have extensive experience in distributed systems, CPU architecture, and programming, particularly in C++ and Python. **

Salary

Base: $184,000 - $287,500 (Level 4) or $224,000 - $356,500 (Level 5); Bonus/Equity: Eligible for equity; Benefits: Comprehensive benefits package included

Skills & Requirements

Must-have

  • Distributed systems mastery with 6+ years experience
  • Expert knowledge of x86/ARM node-level metrics
  • Strong C++ and Python programming proficiency
  • Experience building automated RCA pipelines
  • Familiarity with Slurm or Kubernetes cluster managers

Nice-to-have

  • Deep Linux kernel error-reporting interface knowledge
  • Experience with NVIDIA DCGM and NVML tools
  • Knowledge of checkpoint/restore technologies like CRIU
  • Machine learning model development for failure classification
  • Low-overhead tracing mechanism implementation skills

Key Requirements

  • BS, MS, or PhD in Computer Science or Electrical Engineering
  • 6+ years in systems programming
  • Level 4 or Level 5 experience designation

Work Rights

Not specified

Tailored Resume

Cover Letter