Not specified; not specified; comprehensive medica...
Remote
Kubernetes and containerized environments
Linux systems knowledge and networking
Distributed ml systems and pytorch
Lightning AI is seeking a Platform Support Engineer to act as a technical partner for ML teams running large-scale training and inference workloads
Job Summary
Lightning AI is seeking a Platform Support Engineer to act as a technical partner for ML teams running large-scale training and inference workloads.
The role involves diagnosing complex failures in distributed systems, Kubernetes scheduling, and GPU orchestration while translating infrastructure issues into actionable guidance.
Employees benefit from a comprehensive package including medical coverage, paid time off, professional development support, and a flexible remote work environment.
Matching Summary
Lightning AI is seeking a Platform Support Engineer to act as a technical partner for ML teams running large-scale training and inference workloads.
Salary
Not specified; Not specified; Comprehensive medical, dental, vision, PTO, parental leave, and stipends
Skills & Requirements
Must-have
Kubernetes and containerized environments
Linux systems knowledge and networking
Distributed ML systems and PyTorch
GPU orchestration and CUDA debugging
Observability tools like Prometheus
Nice-to-have
Large scale model training experience
High-performance networking with InfiniBand
Bare metal infrastructure operations
Python automation and tooling scripts
Ray or Kubeflow familiarity
Key Requirements
Strong software engineering and systems troubleshooting background
Experience operating machine learning workloads in production