Network Engineer - Ai/hpc

xAI

Memphis, TN, US
Not specified; not specified; not specified
On-site
10 years network design experience
5 years ethernet ai/hpc space
Deep understanding of rocev2 congestion control
xAI is seeking an engineer with deep experience in RoCEv2 to optimize performance and availability for their massive GPU clusters

Job Summary

  • xAI is seeking an engineer with deep experience in RoCEv2 to optimize performance and availability for their massive GPU clusters.
  • The role involves spending significant time debugging NCCL, building metric dashboards, and tweaking configurations to ensure maximum training and inference efficiency.
  • Candidates must be prepared for frequent travel to Memphis for capacity building and participate in a team on-call rotation.

Matching Summary

xAI is seeking an engineer with deep experience in RoCEv2 to optimize performance and availability for their massive GPU clusters.

Salary

Not specified; Not specified; Not specified

Skills & Requirements

Must-have

  • 10 years network design experience
  • 5 years ethernet AI/HPC space
  • Deep understanding of RoCEv2 congestion control
  • Debugging NCCL library for AI workloads
  • Python automation for network metrics

Nice-to-have

  • Infiniband protocol knowledge
  • Committing to open source libraries
  • Flat organizational structure fit
  • Hands-on engineering excellence mindset
  • Strong communication skills

Key Requirements

  • Minimum 10 years designing large scale networks
  • 5 years experience in ethernet AI/HPC space
  • Expertise in creating performance metrics portfolios
  • Ability to debug and potentially commit to NCCL
  • Proficiency in Python for task automation

Work Rights

Not specified

Tailored Resume

Cover Letter