Member Of Technical Staff, Infrastructure And Training Systems
Radical Numerics
San Francisco, US
On-site
Distributed training systems
Performance optimization
Reusable training frameworks
Design and scale distributed training systems for large-scale biological world models across large distributed compute systems, focusing on performance, stability, and scalability
Job Summary
Design and scale distributed training systems for large-scale biological world models across large distributed compute systems, focusing on performance, stability, and scalability.
Develop performance optimizations across the stack, including communication patterns, memory efficiency, custom kernels, compilation paths, and systems instrumentation, to ensure training compute is used effectively.
Establish standards and mechanisms for robustness, maintainability, debugging, and safe deployment of fast-moving research infrastructure, including fault tolerance, checkpointing, monitoring, and experiment hygiene.
Matching Summary
Design and scale distributed training systems for large-scale biological world models across large distributed compute systems, focusing on performance, stability, and scalability.
Skills & Requirements
Must-have
distributed training systems
performance optimization
reusable training frameworks
ML infrastructure
Python, PyTorch, Triton, CUDA, C++
Nice-to-have
large-scale distributed training
open-source ML systems contributions
ML runtimes and compilers
quantitative sciences background
Key Requirements
Strong engineering track record
Proficiency in Python, PyTorch, Triton, CUDA, and C++
Strong understanding of modern deep learning frameworks