Senior Hpc Cluster Engineer

NVIDIA

Base: $152,000 - $241,500 usd (level 3) or $184,00...
Not specified (assumed to be hybrid or fully remote based on industry standards).
5 years large scale compute infrastructure experience
Proficiency in slurm lsf pbs or k8s job schedulers
Applied experience with mpi and nccl workflows
NVIDIA is seeking a Senior HPC Cluster Engineer to design, deploy, and manage GPU Compute Clusters for Electronic Design Automation (EDA) and high-performance computing workloads. The role requires extensive experience in HPC systems, automation, and collaboration with diverse teams, contributing to innovative solutions and performance enhancements

Job Summary

  • The role involves designing, deploying, and operating GPU Compute Clusters specifically for Electronic Design Automation and high-performance computing workloads.
  • Candidates will develop scalable automation solutions to improve infrastructure provisioning, management, observability, and day-to-day operations.
  • NVIDIA offers a competitive base salary ranging from $152,000 to $287,500 depending on the level, along with equity and benefits.

Matching Summary

Match Score: 85

NVIDIA is seeking a Senior HPC Cluster Engineer to design, deploy, and manage GPU Compute Clusters for Electronic Design Automation (EDA) and high-performance computing workloads. The role requires extensive experience in HPC systems, automation, and collaboration with diverse teams, contributing to innovative solutions and performance enhancements.

Salary

Base: $152,000 - $241,500 USD (Level 3) or $184,000 - $287,500 USD (Level 4); Bonus/Equity: Eligible for equity; Benefits: Comprehensive benefits package included

Skills & Requirements

Must-have

  • 5 years large scale compute infrastructure experience
  • Proficiency in Slurm LSF PBS or K8s job schedulers
  • Applied experience with MPI and NCCL workflows
  • Linux administration on Rocky CentOS RHEL Ubuntu
  • Container technologies Enroot and Docker proficiency
  • Python and Bash scripting expertise

Nice-to-have

  • Background with NVIDIA GPUs CUDA Programming MLPerf
  • Experience supporting EDA workloads and tools
  • Familiarity with InfiniBand RDMA RoCE networking
  • Understanding of Lustre GPFS distributed storage
  • Metrics collection with Prometheus OpenSearch Grafana

Key Requirements

  • Bachelor's degree in Computer Science Electrical Engineering or equivalent
  • Minimum 5 years experience with cluster configuration management tools like BCM or Ansible
  • Strong problem-solving skills for complex system analysis

Work Rights

Not specified

Tailored Resume

Cover Letter