Senior Dgx Cloud Ai Infrastructure Software Engineer

Invidia

Multiple Locations
Base: 184,000 usd - 287,500 usd (level 4), 224,000...
Large-scale ai infrastructure development
Distributed systems engineering
Ai training and inferencing services
Joining NVIDIA's DGX Cloud AI Efficiency Team means contributing to the infrastructure that powers innovative AI research and delivering a stable, scalable environment for AI researchers

Job Summary

  • Joining NVIDIA's DGX Cloud AI Efficiency Team means contributing to the infrastructure that powers innovative AI research and delivering a stable, scalable environment for AI researchers.
  • The role offers autonomy to work on meaningful projects with support and mentorship, fostering a culture of learning, growth, and risk-taking.
  • NVIDIA is committed to diversity and equal opportunity, providing competitive salary ranges, equity, and benefits.

Matching Summary

Joining NVIDIA's DGX Cloud AI Efficiency Team means contributing to the infrastructure that powers innovative AI research and delivering a stable, scalable environment for AI researchers.

Salary

Base: 184,000 USD - 287,500 USD (Level 4), 224,000 USD - 356,500 USD (Level 5); Bonus/Equity: Eligible for equity; Benefits: Eligible for benefits

Skills & Requirements

Must-have

  • Large-scale AI infrastructure development
  • Distributed systems engineering
  • AI training and inferencing services
  • Observability platforms (ELK, Prometheus, Loki)
  • Programming in Python and C/C++
  • Software engineering best practices

Nice-to-have

  • Experience with RDMA software stack
  • Defining reliability metrics
  • Root cause analysis at hardware level
  • Working with large scale clusters
  • Telemetry and observability software stack
  • Knowledge of DL frameworks internals
  • Culture of blameless postmortems and iterative improvement

Key Requirements

  • Minimum 8+ years software infrastructure experience
  • Bachelor's degree in Computer Science or related field
  • Strong debugging and triage skills from application to hardware level
  • Experience with large-scale distributed systems
  • Proficiency in Python and C/C++ programming
  • Experience with observability and monitoring tools
  • Excellent communication and collaboration skills

Work Rights

Not specified

Tailored Resume

Cover Letter