Senior Systems Software Engineer, Ai Infrastructure

NVIDIA

Base: 152,000 usd - 241,500 usd (level 3); 184,000...
Not specified (assumed to be hybrid based on industry standards for similar roles).
Python and c/c++/go/perl/ruby proficiency
Linux/windows systems engineering
Cloud platform expertise (aws, azure, gcp, oci)
NVIDIA is seeking a Senior Systems Software Engineer for AI Infrastructure with expertise in software development, systems engineering, and Site Reliability Engineering (SRE). The role involves enhancing large-scale systems for AI model training, focusing on operational reliability, observability, and automation

Job Summary

  • Develop and maintain large-scale systems supporting critical use-cases including frontier model training for AI Infrastructure, driving reliability, operability, and scalability across global public and private clouds.
  • Implement SRE fundamentals, including incident management, monitoring, and performance optimization, while designing automation tools to reduce manual processes and operational overhead.
  • NVIDIA offers highly competitive salaries and a comprehensive benefits package, and is considered one of the technology world’s most desirable employers.

Matching Summary

Match Score: 85

NVIDIA is seeking a Senior Systems Software Engineer for AI Infrastructure with expertise in software development, systems engineering, and Site Reliability Engineering (SRE). The role involves enhancing large-scale systems for AI model training, focusing on operational reliability, observability, and automation.

Salary

Base: 152,000 USD - 241,500 USD (Level 3); 184,000 USD - 287,500 USD (Level 4); Bonus/Equity: Eligible for equity; Benefits: Comprehensive benefits package

Skills & Requirements

Must-have

  • Python and C/C++/Go/Perl/Ruby proficiency
  • Linux/Windows systems engineering
  • Cloud platform expertise (AWS, Azure, GCP, OCI)
  • SRE principles and error budgets
  • Observability platforms (ELK, Prometheus, Loki)
  • CI/CD systems (GitLab)

Nice-to-have

  • AI training and inferencing infrastructure
  • Deep learning frameworks (PyTorch, TensorFlow, JAX, Ray)
  • Cloud/hardware health monitoring
  • Distributed systems with stringent SLAs
  • Incident, change, and problem management

Key Requirements

  • 5+ years Software Development, SRE, or Production Engineering
  • Degree in Computer Science or related field, or equivalent experience
  • Infrastructure as Code tools (Terraform CDK)
  • Strong communication skills

Work Rights

Not specified

Tailored Resume

Cover Letter