xAI is seeking an engineer with deep experience in RoCEv2 to optimize performance and availability for their massive GPU clusters
Job Summary
xAI is seeking an engineer with deep experience in RoCEv2 to optimize performance and availability for their massive GPU clusters.
The role involves spending significant time debugging NCCL, building metric dashboards, and tweaking configurations to maximize training and inference efficiency.
Candidates must be prepared for significant travel to Memphis for capacity building and participate in a team on-call rotation.
Matching Summary
xAI is seeking an engineer with deep experience in RoCEv2 to optimize performance and availability for their massive GPU clusters.
Skills & Requirements
Must-have
10 years designing large scale networks
5 years in ethernet AI HPC space
Deep understanding of congestion control
Debugging NCCL library and configurations
Python automation for network tasks
Nice-to-have
Experience with Infiniband protocols
Commitment to open source libraries
Flat organizational structure experience
Hands-on engineering mindset
Strong communication skills
Key Requirements
Minimum 10 years network design experience
5 years specifically in ethernet AI/HPC
Expertise in creating performance metrics portfolios