The role involves building high-performance, fault-tolerant infrastructure to support bare-metal provisioning and platform validation for mission-critical GPU clusters
Job Summary
The role involves building high-performance, fault-tolerant infrastructure to support bare-metal provisioning and platform validation for mission-critical GPU clusters.
Candidates will optimize Kubernetes and Slurm platforms for multi-node AI training performance using advanced technologies like NCCL, UCX, and GPUDirect.
The position requires establishing comprehensive observability across GPU, InfiniBand fabric, storage, and provisioning components while documenting architecture designs.
Matching Summary
The role involves building high-performance, fault-tolerant infrastructure to support bare-metal provisioning and platform validation for mission-critical GPU clusters.
Skills & Requirements
Must-have
Bare-metal cluster provisioning with Ironic
Kubernetes internals CRDs operators controllers
Slurm configuration for AI HPC workloads
GPU systems NVIDIA H100/H200 NVLink topology
RDMA InfiniBand RoCE networking optimization
Linux kernel cgroups system services tuning
Performance benchmarking MLPerf NCCL tests
Nice-to-have
Custom Kubernetes operator development
Ansible Terraform automation expertise
Go Python Bash Rust programming skills
BIOS BMC IPMI Redfish firmware knowledge
Continuous improvement mindset in infrastructure
Collaboration with L2 SRE and operations teams
Key Requirements
Bachelor's or Master's degree in Computer Science or Engineering
Experience with Metal3 OpenStack Ironic MaaS xCAT tools