Member Of Technical Staff — Training

Radixark

Palo Alto, CA, United States
Competitive compensation; meaningful equity; compr...
On-site
5+ years ml systems experience
Large-scale distributed training expertise
Gpu/tpu architecture knowledge
RadixArk is seeking a Member of Technical Staff focused on building and scaling systems for training advanced AI models. The role involves working with large-scale distributed training infrastructure and requires significant expertise in ML systems and performance engineering

Job Summary

  • RadixArk is seeking a Member of Technical Staff to build and scale systems that train frontier AI models across thousands of GPUs.
  • The role involves designing large-scale distributed training systems while optimizing throughput, scalability, and hardware efficiency.
  • Candidates will collaborate with model researchers to support frontier experiments and drive capacity planning strategies for cluster utilization.

Matching Summary

Match Score: 85

RadixArk is seeking a Member of Technical Staff focused on building and scaling systems for training advanced AI models. The role involves working with large-scale distributed training infrastructure and requires significant expertise in ML systems and performance engineering.

Salary

Competitive compensation; Meaningful equity; Comprehensive benefits

Skills & Requirements

Must-have

  • 5+ years ML systems experience
  • Large-scale distributed training expertise
  • GPU/TPU architecture knowledge
  • PyTorch or JAX distributed stacks
  • Python plus C++ Go or Rust
  • Production ML systems operations

Nice-to-have

  • Multi-billion parameter model training
  • DeepSpeed Megatron-LM FSDP experience
  • RDMA InfiniBand high-speed interconnects
  • HPC or performance-critical computing background
  • Open-source ML systems contributions
  • Checkpointing fault recovery elastic training
  • Training cost efficiency optimization

Key Requirements

  • 5+ years in ML systems or distributed infrastructure
  • Proficiency in Python and a systems language
  • Experience debugging large training job stability issues

Work Rights

Not specified

Tailored Resume

Cover Letter