Senior Hpc Site Reliability Engineer

Topjobstoday

Not specified (likely hybrid or onsite based on typical hpc roles)
8+ years experience in large scale compute infrastructure
Experience with job schedulers like slurm and sge
Strong script-writing skills in python and bash
Topjobstoday is seeking a Senior HPC Site Reliability Engineer to enhance their HPC infrastructure, focusing on designing and optimizing private compute clouds for advanced applications in chip modeling and deep learning. The ideal candidate will have extensive experience in large-scale compute environments, job scheduling, and cloud services, along with strong scripting skills

Job Summary

  • Join NVIDIA's mission to improve HPC infrastructure and innovate technology.
  • Provide leadership in designing and implementing large-scale compute cloud.
  • Help tackle strategic challenges in resource utilization and cloud strategy.

Matching Summary

Match Score: 85

Topjobstoday is seeking a Senior HPC Site Reliability Engineer to enhance their HPC infrastructure, focusing on designing and optimizing private compute clouds for advanced applications in chip modeling and deep learning. The ideal candidate will have extensive experience in large-scale compute environments, job scheduling, and cloud services, along with strong scripting skills.

Skills & Requirements

Must-have

  • 8+ years experience in large scale compute infrastructure
  • Experience with job schedulers like SLURM and SGE
  • Strong script-writing skills in Python and Bash

Nice-to-have

  • Linux certification from a well-known vendor
  • Prior experience managing large-scale Kubernetes deployment
  • Strong skills in modern container networking

Key Requirements

  • B.Sc in Computer Science or related field
  • Solid understanding of cluster configuration management tools
  • Knowledge of deploying PaaS microservices

Work Rights

Not specified

Tailored Resume

Cover Letter