Principal Datacenter Resiliency Architect, Ras Features And Modeling

Invidia

Us, CA, United States
Base: 272,000 usd - 431,250 usd; bonus/equity: eli...
Fully remote
Gpu hardware architecture expertise
Ras features development
Reliability (fit) modeling
You will be a key member of a team of innovators, challenging the status quo and pushing beyond boundaries

Job Summary

  • You will be a key member of a team of innovators, challenging the status quo and pushing beyond boundaries.
  • This role offers the opportunity to impact the industry's leading Datacenter GPUs and SOCs powering AI and HPC product lines.
  • NVIDIA is committed to fostering a diverse work environment and is proud to be an equal opportunity employer.

Matching Summary

You will be a key member of a team of innovators, challenging the status quo and pushing beyond boundaries.

Salary

Base: 272,000 USD - 431,250 USD; Bonus/Equity: Eligible for equity; Benefits: Eligible for benefits

Skills & Requirements

Must-have

  • GPU hardware architecture expertise
  • RAS features development
  • Reliability (FIT) modeling
  • Hardware/software error handling
  • Python scripting and automation
  • Datacenter system resiliency

Nice-to-have

  • Network and high-speed interface resiliency
  • Leading cross-functional RAS projects
  • Resiliency trade-offs in AI datacenters
  • Collaboration with remote teams
  • Strong debugging and analytical skills

Key Requirements

  • PhD in Computer or Electrical Engineering or equivalent experience
  • 10+ years of relevant experience
  • Strong understanding of GPU and networking architectures
  • Experience with RAS features and reliability modeling
  • Excellent interpersonal and collaboration skills

Work Rights

Not specified

Tailored Resume

Cover Letter