Network Engineer - Ai/hpc

x.ai

Memphis, TN, US
On-site
10 years large scale network design
5 years ethernet ai hpc experience
Deep rocev2 congestion control understanding
xAI is seeking an engineer with deep experience in RoCEv2 to optimize performance and availability for a massive GPU cluster

Job Summary

  • xAI is seeking an engineer with deep experience in RoCEv2 to optimize performance and availability for a massive GPU cluster.
  • The role involves spending significant time deep inside NCCL to build metric dashboards and tweak configurations for training and inference workloads.
  • Candidates must be prepared for significant travel to Memphis for capacity building and participate in a team on-call rotation.

Matching Summary

xAI is seeking an engineer with deep experience in RoCEv2 to optimize performance and availability for a massive GPU cluster.

Skills & Requirements

Must-have

  • 10 years large scale network design
  • 5 years ethernet AI HPC experience
  • Deep RoCEv2 congestion control understanding
  • NCCL debugging and optimization expertise
  • Python automation for network operations

Nice-to-have

  • Infiniband protocol knowledge
  • Committing to open source libraries
  • Hands-on engineering culture fit
  • Strong communication skills
  • Curiosity and initiative driven

Key Requirements

  • Minimum 10 years designing large scale networks
  • 5 years experience in ethernet AI/HPC space
  • Expertise in creating performance metrics portfolios

Work Rights

Not specified

Tailored Resume

Cover Letter