Experience with kubernetes and container orchestration
Together AI is building the AI Acceleration Cloud to combine the fastest LLM inference engine with state-of-the-art AI cloud infrastructure
Job Summary
Together AI is building the AI Acceleration Cloud to combine the fastest LLM inference engine with state-of-the-art AI cloud infrastructure.
The role involves designing and developing foundational backend services that power a highly available global cloud platform serving internal and external customers.
Candidates will work on a distributed GPU scheduling system and manage a global plane for data center compute, networking, and storage.
Matching Summary
Together AI is building the AI Acceleration Cloud to combine the fastest LLM inference engine with state-of-the-art AI cloud infrastructure.
Salary
Base: $160,000 - $230,000; Equity: Startup equity included; Benefits: Health insurance and remote flexibility
Skills & Requirements
Must-have
5+ years building large scale distributed systems
Expert-level Golang programming skills
Experience with Kubernetes and container orchestration
Strong knowledge of compute networking and storage
Proficiency in relational database PostgreSQL
Nice-to-have
Experience with AWS Azure or GCP cloud providers
Familiarity with Kinesis Airflow Kafka data infrastructure
Background in ML hardware virtualization
Experience with Slurm cluster management
Knowledge of GB200s/GB300s BlueField DPUs
Key Requirements
Bachelor's or Master's degree in Computer Science or related field
5+ years experience in fault tolerant distributed systems