Member Of Technical Staff, Infrastructure And Training Systems

Radical Numerics

San Francisco, US
On-site
Distributed training systems
Performance optimization
Reusable training frameworks
Design and scale distributed training systems for large-scale biological world models across large distributed compute systems, focusing on performance, stability, and scalability

Job Summary

  • Design and scale distributed training systems for large-scale biological world models across large distributed compute systems, focusing on performance, stability, and scalability.
  • Develop performance optimizations across the stack, including communication patterns, memory efficiency, custom kernels, compilation paths, and systems instrumentation, to ensure training compute is used effectively.
  • Establish standards and mechanisms for robustness, maintainability, debugging, and safe deployment of fast-moving research infrastructure, including fault tolerance, checkpointing, monitoring, and experiment hygiene.

Matching Summary

Design and scale distributed training systems for large-scale biological world models across large distributed compute systems, focusing on performance, stability, and scalability.

Skills & Requirements

Must-have

  • distributed training systems
  • performance optimization
  • reusable training frameworks
  • ML infrastructure
  • Python, PyTorch, Triton, CUDA, C++

Nice-to-have

  • large-scale distributed training
  • open-source ML systems contributions
  • ML runtimes and compilers
  • quantitative sciences background

Key Requirements

  • Strong engineering track record
  • Proficiency in Python, PyTorch, Triton, CUDA, and C++
  • Strong understanding of modern deep learning frameworks
  • Ability to debug complex systems
  • Comfort working in a collaborative environment
  • Excellent communication skills

Work Rights

Not specified

Tailored Resume

Cover Letter