Experience with kubernetes, slurm, ray, or custom schedulers
The role involves designing and implementing infrastructure that supports large-scale experiments, data processing, and model training across thousands of GPUs
Job Summary
The role involves designing and implementing infrastructure that supports large-scale experiments, data processing, and model training across thousands of GPUs.
Candidates will partner closely with research scientists and ML engineers to turn experimental workloads into robust, repeatable pipelines.
Databricks offers a comprehensive benefits package and is committed to fair compensation practices with potential for annual performance bonuses and equity.
Matching Summary
The role involves designing and implementing infrastructure that supports large-scale experiments, data processing, and model training across thousands of GPUs.
Salary
Base: $199,000 - $270,000 USD; Bonus/Equity: Eligible for annual performance bonus and equity; Benefits: Comprehensive benefits package offered
Skills & Requirements
Must-have
5+ years distributed systems experience
Proficiency in C++, Rust, Go, Java, or Scala
Experience with Kubernetes, Slurm, Ray, or custom schedulers
Deep understanding of ML training and inference workflows
Building large-scale backend services and data pipelines
Nice-to-have
Mentoring engineers on compute and AI systems
Translating research needs into infrastructure solutions
Driving complex systems from prototype to production
Pragmatic approach to operational excellence
Experience with HPC clusters or cloud-based GPU fleets
Key Requirements
BS/MS or PhD in Computer Science or related field
5+ years of software engineering experience
Experience with cluster schedulers or resource managers