Senior Hpc Infrastructure Engineer

Firmus Metal International Pte. Ltd

Sydney, NSW, Australia
On-site
Bare-metal cluster provisioning with ironic
Kubernetes internals crds operators controllers
Slurm configuration for ai hpc workloads
The role involves building high-performance, fault-tolerant infrastructure to support bare-metal provisioning and platform validation for mission-critical GPU clusters

Job Summary

  • The role involves building high-performance, fault-tolerant infrastructure to support bare-metal provisioning and platform validation for mission-critical GPU clusters.
  • Candidates will optimize Kubernetes and Slurm platforms for multi-node AI training performance using advanced technologies like NCCL, UCX, and GPUDirect.
  • The position requires establishing comprehensive observability across GPU, InfiniBand fabric, storage, and provisioning components while documenting architecture designs.

Matching Summary

The role involves building high-performance, fault-tolerant infrastructure to support bare-metal provisioning and platform validation for mission-critical GPU clusters.

Skills & Requirements

Must-have

  • Bare-metal cluster provisioning with Ironic
  • Kubernetes internals CRDs operators controllers
  • Slurm configuration for AI HPC workloads
  • GPU systems NVIDIA H100/H200 NVLink topology
  • RDMA InfiniBand RoCE networking optimization
  • Linux kernel cgroups system services tuning
  • Performance benchmarking MLPerf NCCL tests

Nice-to-have

  • Custom Kubernetes operator development
  • Ansible Terraform automation expertise
  • Go Python Bash Rust programming skills
  • BIOS BMC IPMI Redfish firmware knowledge
  • Continuous improvement mindset in infrastructure
  • Collaboration with L2 SRE and operations teams

Key Requirements

  • Bachelor's or Master's degree in Computer Science or Engineering
  • Experience with Metal3 OpenStack Ironic MaaS xCAT tools
  • Deep knowledge of Kubernetes lifecycle management
  • Strong understanding of Slurm and GPU topology
  • Practical Linux systems engineering experience
  • Proficiency in Go Bash Rust or Python
  • Experience participating in on-call rotation

Work Rights

Not specified

Tailored Resume

Cover Letter