Senior Networking Solution Test Engineer – Ai Cluster Debugging
Invidia
Multiple Locations
Linux networking and debugging skills
Host-side nic validation and tuning
Ai networking libraries and protocols expertise
You will work on pioneering NVLink, Ethernet and InfiniBand-based AI clusters and own complex issues across hardware, system software and AI workloads
Job Summary
You will work on pioneering NVLink, Ethernet and InfiniBand-based AI clusters and own complex issues across hardware, system software and AI workloads.
Collaborate closely with development teams to debug networking components and define tests to guide automation for robust, debuggable suites producing actionable logs and metrics.
At NVIDIA, we value diversity and are committed to creating an inclusive environment for all employees and provide reasonable accommodations throughout the employment process.
Matching Summary
You will work on pioneering NVLink, Ethernet and InfiniBand-based AI clusters and own complex issues across hardware, system software and AI workloads.
Skills & Requirements
Must-have
Linux networking and debugging skills
Host-side NIC validation and tuning
AI networking libraries and protocols expertise
System-level debugging on Linux
Scripting and automation with Bash/Python/Ansible
End-to-end cluster troubleshooting
Nice-to-have
Hands-on debugging of collective communication libraries
Experience with large-scale AI clusters
Tuning and debugging congestion control for AI workloads
Familiarity with NVIDIA networking technologies
Multi-layer networking issue debugging
Collaborative and fast learner mindset
Key Requirements
B.A./B.Sc. in Computer Science or Electrical Engineering
8+ years networking or system-level testing experience
Strong Linux networking debugging expertise
Proven production-grade debugging experience
Ability to read and reason about source code
Experience with AI networking protocols and libraries