Senior Site Reliability Engineer — Token Factory (inference Platform)
Nebius
Remote
Competitive compensation; not specified; benefits ...
Remote
Deep fluency with kubernetes
Experience with prometheus and grafana
Proficiency in terraform and infrastructure-as-code
The role involves owning the reliability, performance, and observability of an entire inference stack supporting tens of thousands of GPUs
Job Summary
The role involves owning the reliability, performance, and observability of an entire inference stack supporting tens of thousands of GPUs.
Candidates will design telemetry pipelines, tune Kubernetes autoscalers, and harden request-routing logic to ensure seamless operations under extreme load.
Nebius offers a competitive compensation package, career growth opportunities, and a collaborative culture focused on shaping the future of AI.
Matching Summary
The role involves owning the reliability, performance, and observability of an entire inference stack supporting tens of thousands of GPUs.
Salary
Competitive compensation; Not specified; Benefits include flexibility and learning opportunities
Skills & Requirements
Must-have
Deep fluency with Kubernetes
Experience with Prometheus and Grafana
Proficiency in Terraform and infrastructure-as-code
Scripting skills in Python or Bash
Knowledge of GPU-heavy workloads
Nice-to-have
Background in MLOps or model-hosting platforms
Experience with vLLM, Triton, or Ray
Understanding of alert design and SLOs
Ability to debug from kernel to application layer
Collaborative mindset with software engineers
Key Requirements
Deep fluency with Kubernetes, Prometheus, Grafana, and Terraform
Comfortable scripting in Python or Bash
Experience shepherding GPU-heavy workloads
Proven track record in production environments
Strong understanding of distributed backend failures
Work Rights
Must be authorized to work in the country of application