Network Engineer - Ai/hpc

xAI

Memphis, TN, US
On-site
10 years designing large scale networks
5 years in ethernet ai hpc space
Deep understanding of rocev2 congestion control
xAI is seeking an engineer with deep experience in RoCEv2 to optimize performance and availability for a 100k GPU cluster

Job Summary

  • xAI is seeking an engineer with deep experience in RoCEv2 to optimize performance and availability for a 100k GPU cluster.
  • The role involves spending significant time deep inside NCCL, building metric dashboards, and tweaking configurations to ensure maximum training and inference efficiency.
  • Candidates must be prepared for significant travel to Memphis for capacity building and participation in a team on-call rotation.

Matching Summary

xAI is seeking an engineer with deep experience in RoCEv2 to optimize performance and availability for a 100k GPU cluster.

Skills & Requirements

Must-have

  • 10 years designing large scale networks
  • 5 years in ethernet AI HPC space
  • Deep understanding of RoCEv2 congestion control
  • Debugging NCCL library for AI workloads
  • Python automation for network operations

Nice-to-have

  • Experience with Infiniband protocols
  • Ability to commit code to NCCL library
  • Flat organizational structure experience
  • Hands-on engineering culture fit
  • Strong prioritization and communication skills

Key Requirements

  • Minimum 10 years designing large scale networks
  • 5 years experience in ethernet AI HPC space
  • Expertise in Python for automation

Work Rights

Not specified

Tailored Resume

Cover Letter