Senior Site Reliability Engineer — Token Factory (inference Platform)

Nebius

Remote
Competitive compensation; not specified; benefits ...
Remote
Deep fluency with kubernetes
Experience with prometheus and grafana
Proficiency in terraform and infrastructure-as-code
The role involves owning the reliability, performance, and observability of an entire inference stack supporting tens of thousands of GPUs

Job Summary

  • The role involves owning the reliability, performance, and observability of an entire inference stack supporting tens of thousands of GPUs.
  • Candidates will design telemetry pipelines, tune Kubernetes autoscalers, and harden request-routing logic to ensure seamless operations under extreme load.
  • Nebius offers a competitive compensation package, career growth opportunities, and a collaborative culture focused on shaping the future of AI.

Matching Summary

The role involves owning the reliability, performance, and observability of an entire inference stack supporting tens of thousands of GPUs.

Salary

Competitive compensation; Not specified; Benefits include flexibility and learning opportunities

Skills & Requirements

Must-have

  • Deep fluency with Kubernetes
  • Experience with Prometheus and Grafana
  • Proficiency in Terraform and infrastructure-as-code
  • Scripting skills in Python or Bash
  • Knowledge of GPU-heavy workloads

Nice-to-have

  • Background in MLOps or model-hosting platforms
  • Experience with vLLM, Triton, or Ray
  • Understanding of alert design and SLOs
  • Ability to debug from kernel to application layer
  • Collaborative mindset with software engineers

Key Requirements

  • Deep fluency with Kubernetes, Prometheus, Grafana, and Terraform
  • Comfortable scripting in Python or Bash
  • Experience shepherding GPU-heavy workloads
  • Proven track record in production environments
  • Strong understanding of distributed backend failures

Work Rights

Must be authorized to work in the country of application

Tailored Resume

Cover Letter