Senior Solutions Architect, Cloud Infrastructure And Devops - Nvis

Nvidia Corporation

8+ years networking fundamentals tcp/ip
Kubernetes container orchestration for ai/ml
Hpc cluster deployment and troubleshooting
This role involves maintaining large-scale HPC/AI clusters with comprehensive monitoring, logging, and alerting capabilities

Job Summary

  • This role involves maintaining large-scale HPC/AI clusters with comprehensive monitoring, logging, and alerting capabilities.
  • The successful candidate will act as the face to the customer, analyzing and defining large-scale networking projects in collaboration with partners and internal teams.
  • NVIDIA is seeking an autonomous and creative professional to join a dynamic team building some of the world's largest and fastest AI systems.

Matching Summary

This role involves maintaining large-scale HPC/AI clusters with comprehensive monitoring, logging, and alerting capabilities.

Skills & Requirements

Must-have

  • 8+ years networking fundamentals TCP/IP
  • Kubernetes container orchestration for AI/ML
  • HPC cluster deployment and troubleshooting
  • Slurm Kubernetes Singularity job scheduling
  • Linux internals Redhat CentOS Ubuntu
  • Python programming and bash scripting
  • Jenkins Ansible Puppet Chef automation

Nice-to-have

  • CPU GPU architecture knowledge
  • DGX CUDA GPU-focused hardware experience
  • RDMA InfiniBand RoCE fabric familiarity
  • Emerging storage technologies awareness
  • Japanese-speaking customer collaboration

Key Requirements

  • BS/MS/PhD in Computer Science or related field
  • Minimum 8 years professional experience in networking
  • Extensive hands-on experience with Kubernetes
  • Proficiency in Python and Bash scripting
  • Experience with Lustre GPFS ZFS XFS storage

Work Rights

Not specified

Tailored Resume

Cover Letter