Nvidia Corporation is seeking a Senior System Architect to develop automated frameworks for failure attribution in heterogeneous EDA systems, focusing on performance optimization and root cause analysis. The ideal candidate will have extensive experience in distributed systems, CPU architecture, and programming, particularly with C++ and Python
Job Summary
NVIDIA is seeking a Senior System Architect to develop an automated framework for failure attribution at scale in accelerated computing.
The role involves architecting frameworks that capture high-fidelity state across CPU, GPU, and Fabric at the moment of failure.
Candidates will work closely with hardware and infrastructure teams to define signals of impending failure for proactive measures.
Matching Summary
Match Score: 85
Nvidia Corporation is seeking a Senior System Architect to develop automated frameworks for failure attribution in heterogeneous EDA systems, focusing on performance optimization and root cause analysis. The ideal candidate will have extensive experience in distributed systems, CPU architecture, and programming, particularly with C++ and Python.
Salary
Base: 184,000 USD - 287,500 USD; Bonus/Equity: Not specified; Benefits: Not specified
Skills & Requirements
Must-have
Automated root cause analysis pipelines
Expert knowledge of CPU architecture
Strong C++ and Python programming skills
Experience with cluster resource managers
Nice-to-have
Expert knowledge of Linux kernel
Experience with NVIDIA DCGM and NVML
Familiarity with checkpoint/restore technologies
Key Requirements
6+ years in systems programming
BS, MS, or PhD in Computer Science or Electrical Engineering