Networking Solution Test Engineer - Ai Cluster Debugging
Invidia
Multiple Locations
Linux networking and debugging skills
Host-side nic validation and tuning
Ai networking libraries and protocols
You will work on cutting-edge Ethernet-based AI clusters, owning complex issues across hardware, system software and AI workloads
Job Summary
You will work on cutting-edge Ethernet-based AI clusters, owning complex issues across hardware, system software and AI workloads.
Collaborate closely with development teams to debug NCCL, RoCE/RDMA and related networking components using logs, code inspection and targeted experiments.
NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer.
Matching Summary
You will work on cutting-edge Ethernet-based AI clusters, owning complex issues across hardware, system software and AI workloads.
Skills & Requirements
Must-have
Linux networking and debugging skills
Host-side NIC validation and tuning
AI networking libraries and protocols
System-level testing and debugging
Scripting and automation with Bash/Python/Ansible
Nice-to-have
Debugging collective communication libraries
Experience with large-scale GPU clusters
Tuning congestion control and lossless Ethernet
Familiarity with NVIDIA networking technologies
Multi-layer networking and AI framework debugging
Key Requirements
B.A./B.Sc. in Computer Science or Electrical Engineering
2+ years networking or system-level testing experience