Senior System Architect, Infrastructure Reliability

NVIDIA

Multiple Locations
Base: 184,000 usd - 287,500 usd for level 4, 224,0...
Hybrid
Automated root cause analysis pipelines
High-performance c++ and python programming
Cluster resource manager experience
NVIDIA is seeking a Senior System Architect to build a scalable failure attribution framework that captures high-fidelity state across CPU, GPU, and fabric at failure moments

Job Summary

  • NVIDIA is seeking a Senior System Architect to build a scalable failure attribution framework that captures high-fidelity state across CPU, GPU, and fabric at failure moments.
  • The role involves developing automated diagnostics correlating hardware and system-level errors and implementing low-overhead distributed logging and tracing across multi-node clusters.
  • Candidates will be eligible for equity and benefits with a base salary range depending on level and location, and applications are accepted until March 1, 2026.

Matching Summary

NVIDIA is seeking a Senior System Architect to build a scalable failure attribution framework that captures high-fidelity state across CPU, GPU, and fabric at failure moments.

Salary

Base: 184,000 USD - 287,500 USD for Level 4, 224,000 USD - 356,500 USD for Level 5; Bonus/Equity: Eligible for equity; Benefits: Eligible for benefits

Skills & Requirements

Must-have

  • Automated root cause analysis pipelines
  • High-performance C++ and Python programming
  • Cluster resource manager experience
  • Distributed systems programming
  • Failure attribution framework architecture
  • Real-time telemetry ingestion

Nice-to-have

  • Linux kernel error-reporting expertise
  • GPU infrastructure monitoring tools
  • Checkpoint/restore technology experience
  • Machine learning heuristics for failure classification

Key Requirements

  • 6+ years systems programming experience
  • BS, MS, or PhD in Computer Science or Electrical Engineering or equivalent
  • Experience with HPC or cloud-scale RCA pipelines
  • Expert knowledge of x86/ARM node-level metrics
  • Familiarity with Slurm, LSF, or Kubernetes
  • Not specified work authorization

Work Rights

Not specified

Tailored Resume

Cover Letter