Network Engineer - Ai/hpc

xAI

Memphis, TN, US
On-site
10 years designing large scale networks
5 years in ethernet ai hpc space
Deep understanding of rocev2 congestion control
xAI is seeking an engineer with deep experience in RoCEv2 to optimize performance and availability for their massive GPU clusters

Job Summary

  • xAI is seeking an engineer with deep experience in RoCEv2 to optimize performance and availability for their massive GPU clusters.
  • The role involves spending significant time debugging NCCL, building metric dashboards, and tweaking configurations to maximize training and inference efficiency.
  • Candidates must be prepared for frequent travel to Memphis for capacity building and participate in a team on-call rotation.

Matching Summary

xAI is seeking an engineer with deep experience in RoCEv2 to optimize performance and availability for their massive GPU clusters.

Skills & Requirements

Must-have

  • 10 years designing large scale networks
  • 5 years in ethernet AI HPC space
  • Deep understanding of RoCEv2 congestion control
  • Debugging NCCL for AI training workloads
  • Python automation for network operations

Nice-to-have

  • Infiniband experience
  • Committing to NCCL library code
  • Hands-on engineering excellence mindset
  • Strong prioritization skills
  • Flat organizational structure adaptation

Key Requirements

  • Minimum 10 years network design experience
  • 5 years specific ethernet AI/HPC experience
  • Expertise in Python scripting
  • Experience with Infiniband (bonus)

Work Rights

Not specified

Tailored Resume

Cover Letter