Design and scale a distributed ML platform for data, feature store, training, and inferencing, building multi-region, containerized infrastructure using Kubernetes and Terraform
Job Summary
Design and scale a distributed ML platform for data, feature store, training, and inferencing, building multi-region, containerized infrastructure using Kubernetes and Terraform.
Optimize GPU/TPU/CPU utilization for large-scale training and real-time inferencing, implementing distributed training, model parallelism, and caching with Ray.
Drive FinOps-aligned architecture, auto-scaling, and efficiency, enabling observability, SLA/SLO tracking, and incident management.
Matching Summary
Design and scale a distributed ML platform for data, feature store, training, and inferencing, building multi-region, containerized infrastructure using Kubernetes and Terraform.
Skills & Requirements
Must-have
Distributed ML platform design
Kubernetes and Terraform infrastructure
GPU/TPU/CPU optimization
Ray for distributed training
FinOps-aligned architecture
Python programming
Nice-to-have
Privacy-first principles
AI Commerce innovation
Inspiration-led discovery
Key Requirements
8-12 years in ML platform/distributed systems
Strong in Python, Kubernetes, GPU/TPU optimization
Proven ability to design fault-tolerant, high-throughput ML systems