Senior Server Ras Engineer

Nvidia Corporation

Not specified; not specified; not specified
10+ years experience in reliability engineering
Strong python programming in linux environment
Deep understanding of linux kernel internals
This role focuses on improving the reliability of NVIDIA GPU and Grace systems by designing robust RAS features for enterprise AI infrastructure

Job Summary

  • This role focuses on improving the reliability of NVIDIA GPU and Grace systems by designing robust RAS features for enterprise AI infrastructure.
  • The engineer will collaborate with multi-functional teams including hardware engineers and software developers to meet strict reliability standards.
  • Candidates must possess strong Python programming skills in a Linux operating environment along with extensive knowledge of system-level architecture.

Matching Summary

This role focuses on improving the reliability of NVIDIA GPU and Grace systems by designing robust RAS features for enterprise AI infrastructure.

Salary

Not specified; Not specified; Not specified

Skills & Requirements

Must-have

  • 10+ years experience in reliability engineering
  • Strong Python programming in Linux environment
  • Deep understanding of Linux kernel internals
  • System-level architecture invention skills
  • Fault tolerance mechanism design expertise

Nice-to-have

  • Hands-on experience with scale-out architectures
  • Proficiency in system-level simulation tools
  • Familiarity with x86 or ARM system architecture
  • Understanding of machine check architecture interactions
  • Track record of platform-level RAS implementation

Key Requirements

  • BS, MS, or PhD in EE/CS or related field
  • 10+ years of demonstrated experience
  • Strong code review skills required

Work Rights

Not specified

Tailored Resume

Cover Letter