xAI is seeking an engineer with deep experience in RoCEv2 to optimize performance and availability for their massive GPU clusters
Job Summary
xAI is seeking an engineer with deep experience in RoCEv2 to optimize performance and availability for their massive GPU clusters.
The role involves spending significant time debugging NCCL, building metric dashboards, and tweaking configurations to maximize training and inference efficiency.
Candidates must be prepared for frequent travel to Memphis for capacity building and participate in a team on-call rotation.
Matching Summary
xAI is seeking an engineer with deep experience in RoCEv2 to optimize performance and availability for their massive GPU clusters.