Networking Solution Test Engineer - Ai Cluster Debugging

Invidia

Multiple Locations
Linux networking and debugging skills
Host-side nic validation and tuning
Ai networking libraries and protocols
You will work on cutting-edge Ethernet-based AI clusters, owning complex issues across hardware, system software and AI workloads

Job Summary

  • You will work on cutting-edge Ethernet-based AI clusters, owning complex issues across hardware, system software and AI workloads.
  • Collaborate closely with development teams to debug NCCL, RoCE/RDMA and related networking components using logs, code inspection and targeted experiments.
  • NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer.

Matching Summary

You will work on cutting-edge Ethernet-based AI clusters, owning complex issues across hardware, system software and AI workloads.

Skills & Requirements

Must-have

  • Linux networking and debugging skills
  • Host-side NIC validation and tuning
  • AI networking libraries and protocols
  • System-level testing and debugging
  • Scripting and automation with Bash/Python/Ansible

Nice-to-have

  • Debugging collective communication libraries
  • Experience with large-scale GPU clusters
  • Tuning congestion control and lossless Ethernet
  • Familiarity with NVIDIA networking technologies
  • Multi-layer networking and AI framework debugging

Key Requirements

  • B.A./B.Sc. in Computer Science or Electrical Engineering
  • 2+ years networking or system-level testing experience
  • Proven production-grade debugging experience
  • Ability to read and reason about source code
  • Strong ownership and collaboration skills

Work Rights

Not specified

Tailored Resume

Cover Letter