The company is building next-generation AI infrastructure from the ground up to deliver highly performant and scalable network clusters for large-scale AI training and inference
Job Summary
The company is building next-generation AI infrastructure from the ground up to deliver highly performant and scalable network clusters for large-scale AI training and inference.
This role requires leading hands-on bringup of network clusters across data center environments, owning execution from node installation to production readiness.
Success involves validating BIOS configurations, tuning fabrics, debugging performance issues, and transforming ad hoc deployments into repeatable, reliable systems.
Matching Summary
The company is building next-generation AI infrastructure from the ground up to deliver highly performant and scalable network clusters for large-scale AI training and inference.
Skills & Requirements
Must-have
5-8+ years infrastructure engineering experience
Hands-on HGX/DGX server deployment
High-speed networking InfiniBand RoCE Ethernet
Strong Linux systems knowledge
Distributed systems performance troubleshooting
Onsite data center work capability
Nice-to-have
AI ML infrastructure or HPC environment experience