xAI is seeking an engineer with deep experience in RoCEv2 to optimize performance and availability for a 100k GPU cluster
Job Summary
xAI is seeking an engineer with deep experience in RoCEv2 to optimize performance and availability for a 100k GPU cluster.
The role involves spending significant time deep inside NCCL, building metric dashboards, and tweaking configurations to ensure maximum training and inference efficiency.
Candidates must be prepared for significant travel to Memphis for capacity building and participation in a team on-call rotation.
Matching Summary
xAI is seeking an engineer with deep experience in RoCEv2 to optimize performance and availability for a 100k GPU cluster.