Senior Cloud Infrastructure Engineer

Gatik AI

Mountain View, CA, United States
On-site
Kubernetes cluster management for gpu workloads
Apache airflow and kafka pipeline development
Terraform infrastructure as code implementation
This role serves as the backbone of the AI platform by building high-performance systems that enable researchers to develop perception and planning models

Job Summary

  • This role serves as the backbone of the AI platform by building high-performance systems that enable researchers to develop perception and planning models.
  • The engineer will architect mission-critical Kubernetes clusters optimized for heavy GPU/TPU workloads while implementing self-healing infrastructure using autonomous AI agents.
  • Candidates must be willing to work onsite 5 days a week at the Mountain View, CA office to support the company's autonomous middle-mile logistics operations.

Matching Summary

This role serves as the backbone of the AI platform by building high-performance systems that enable researchers to develop perception and planning models.

Skills & Requirements

Must-have

  • Kubernetes cluster management for GPU workloads
  • Apache Airflow and Kafka pipeline development
  • Terraform Infrastructure as Code implementation
  • ArgoCD GitOps workflow automation
  • NCCL networking optimization for distributed training

Nice-to-have

  • Experience with LangGraph and CrewAI agents
  • Familiarity with Triton Inference Server and Ray Serve
  • Knowledge of 3D Gaussian Splatting techniques
  • Background in PyTorch Distributed and DeepSpeed

Key Requirements

  • 5+ years experience in Cloud Infrastructure or MLOps
  • Deep expertise in Kubernetes, Helm, and container orchestration
  • Strong background in Apache Airflow, Argo Workflows, MLFlow, and Terraform

Work Rights

Not specified

Tailored Resume

Cover Letter