Senior Datacenter Resiliency Architect

NVIDIA

Base: 184,000 usd - 356,500 usd; bonus/equity: not...
Fully remote
Gpu hardware architecture
Ras features
Architecture models development
Architect hardware and software Resiliency features to improve system Reliability, Availability, Serviceability (RAS), and performance in the Datacenter

Job Summary

  • Architect hardware and software Resiliency features to improve system Reliability, Availability, Serviceability (RAS), and performance in the Datacenter.
  • Develop CUDA software diagnostics kernels for to run on clusters of NVIDIA GPUs and identify potential hardware issues.
  • Join our Accelerated and Resilient Compute Systems team and help build the resilient, highly available, cost-effective computing platform driving our success.

Matching Summary

Architect hardware and software Resiliency features to improve system Reliability, Availability, Serviceability (RAS), and performance in the Datacenter.

Salary

Base: 184,000 USD - 356,500 USD; Bonus/Equity: Not specified; Benefits: Not specified

Skills & Requirements

Must-have

  • GPU hardware architecture
  • RAS features
  • Architecture models development
  • Python scripting and automation
  • C/C++ proficiency
  • Debugging and analytical skills

Nice-to-have

  • Resiliency and datacenter RAS experience
  • Verilog/System Verilog RTL simulations
  • CUDA programming
  • Machine Learning/Deep Learning concepts

Key Requirements

  • Master’s or PhD degree or equivalent experience
  • 5+ years of relevant experience
  • Familiarity with GPU and Networking Architectures
  • Computer Architecture basics knowledge

Work Rights

Not specified

Tailored Resume

Cover Letter