Distributed gpu training collective algorithms understanding
This role focuses on optimizing the performance of machine learning models for both training and real-time inference within a rapid-feedback trading environment
Job Summary
This role focuses on optimizing the performance of machine learning models for both training and real-time inference within a rapid-feedback trading environment.
The position requires a whole-systems approach that includes storage systems, networking, and host- and GPU-level considerations to ensure efficient large-scale operations.
Candidates must possess deep low-level GPU knowledge including PTX, SASS, warps, and memory hierarchy to debug and optimize CUDA performance effectively.
Matching Summary
This role focuses on optimizing the performance of machine learning models for both training and real-time inference within a rapid-feedback trading environment.
Skills & Requirements
Must-have
Low-level systems programming experience
CUDA PTX SASS warps cooperative groups knowledge
Distributed GPU training collective algorithms understanding
Infiniband RoCE GPUDirect NVLink networking expertise