Distributed gpu training collective algorithms understanding
This role focuses on optimizing the performance of machine learning models for both training and real-time inference within a rapid-feedback trading environment
Job Summary
This role focuses on optimizing the performance of machine learning models for both training and real-time inference within a rapid-feedback trading environment.
Candidates must possess deep low-level GPU knowledge including PTX, SASS, warps, and memory hierarchy to ensure efficient large-scale operations.
The team values an inventive approach and encourages asking hard questions about whether the right tools and methodologies are being used.
Matching Summary
This role focuses on optimizing the performance of machine learning models for both training and real-time inference within a rapid-feedback trading environment.
Skills & Requirements
Must-have
Low-level systems programming experience
CUDA PTX SASS warps cooperative groups knowledge
Distributed GPU training collective algorithms understanding
Infiniband RoCE GPUDirect NVLink networking technologies