xAI is seeking an engineer with deep experience in RoCEv2 to optimize performance and availability for their massive GPU clusters
Job Summary
xAI is seeking an engineer with deep experience in RoCEv2 to optimize performance and availability for their massive GPU clusters.
The role involves spending significant time debugging NCCL, building metric dashboards, and tweaking configurations to ensure maximum training and inference efficiency.
Candidates must be prepared for frequent travel to Memphis for capacity building and participate in a team on-call rotation.
Matching Summary
xAI is seeking an engineer with deep experience in RoCEv2 to optimize performance and availability for their massive GPU clusters.
Salary
Not specified; Not specified; Not specified
Skills & Requirements
Must-have
10 years network design experience
5 years ethernet AI/HPC space
Deep understanding of RoCEv2 congestion control
Debugging NCCL library for AI workloads
Python automation for network metrics
Nice-to-have
Infiniband protocol knowledge
Committing to open source libraries
Flat organizational structure fit
Hands-on engineering excellence mindset
Strong communication skills
Key Requirements
Minimum 10 years designing large scale networks
5 years experience in ethernet AI/HPC space
Expertise in creating performance metrics portfolios