You will help build and evolve systems that support performance analysis, telemetry, and optimization for large-scale GPU- and CPU-based clusters used in AI and high-performance computing environments
Job Summary
You will help build and evolve systems that support performance analysis, telemetry, and optimization for large-scale GPU- and CPU-based clusters used in AI and high-performance computing environments.
This is a fast-paced R&D environment where system behavior and requirements evolve rapidly, requiring adaptable engineering solutions and strong analytical thinking.
You will work closely with hardware, networking, firmware, and software teams to collect, analyze, and interpret performance data from live systems.
Matching Summary
You will help build and evolve systems that support performance analysis, telemetry, and optimization for large-scale GPU- and CPU-based clusters used in AI and high-performance computing environments.
Skills & Requirements
Must-have
Performance analysis of AI and HPC workloads
High-performance networking expertise
Cross-functional R&D collaboration
Telemetry collection and data refinement
Performance benchmarking and diagnostic tools
Nice-to-have
Knowledge of CUDA and NCCL internals
Experience with deep learning frameworks
Proficiency in Python and Linux environments
Experience with cloud platforms
Strong analytical and communication skills
Key Requirements
B.Sc. or M.Sc. in Computer Science or related field
5+ years experience in performance analysis or HPC/AI infrastructure
Hands-on experience with RDMA, MPI, NCCL
Strong understanding of system performance metrics
Ability to work in fast-paced cross-functional teams