Sde Iii

InMobi

Bangalore, India
On-site
Distributed ml platform design
Kubernetes and terraform infrastructure
Gpu/tpu/cpu optimization
Design and scale a distributed ML platform for data, feature store, training, and inferencing, building multi-region, containerized infrastructure using Kubernetes and Terraform

Job Summary

  • Design and scale a distributed ML platform for data, feature store, training, and inferencing, building multi-region, containerized infrastructure using Kubernetes and Terraform.
  • Optimize GPU/TPU/CPU utilization for large-scale training and real-time inferencing, implementing distributed training, model parallelism, and caching with Ray.
  • Drive FinOps-aligned architecture, auto-scaling, and efficiency, enabling observability, SLA/SLO tracking, and incident management.

Matching Summary

Design and scale a distributed ML platform for data, feature store, training, and inferencing, building multi-region, containerized infrastructure using Kubernetes and Terraform.

Skills & Requirements

Must-have

  • Distributed ML platform design
  • Kubernetes and Terraform infrastructure
  • GPU/TPU/CPU optimization
  • Ray for distributed training
  • FinOps-aligned architecture
  • Python programming

Nice-to-have

  • Privacy-first principles
  • AI Commerce innovation
  • Inspiration-led discovery

Key Requirements

  • 8-12 years in ML platform/distributed systems
  • Strong in Python, Kubernetes, GPU/TPU optimization
  • Proven ability to design fault-tolerant, high-throughput ML systems

Work Rights

Not specified

Tailored Resume

Cover Letter