Lead Software Engineer, Devops Domain (bangkok Based, Relocation Provided)

Agoda

Bangkok, Thailand
On-site
Sre platforms and reliability initiatives
Sli/slo-driven engineering
Kubernetes ecosystem and service mesh
Lead the technical vision, architecture, and execution of new SRE platforms or reliability initiatives

Job Summary

  • Lead the technical vision, architecture, and execution of new SRE platforms or reliability initiatives.
  • Design, build, and operate reliability platforms including load shedding, business signals monitoring, and safe-deployment automation.
  • Advance platform observability and reliability signals using Prometheus and Grafana, balancing actionability, scale, and cost efficiency.

Matching Summary

Lead the technical vision, architecture, and execution of new SRE platforms or reliability initiatives.

Skills & Requirements

Must-have

  • SRE platforms and reliability initiatives
  • SLI/SLO-driven engineering
  • Kubernetes ecosystem and service mesh
  • Prometheus and Grafana expertise
  • Incident management lifecycle
  • Canary deployments and automated rollback

Nice-to-have

  • Chaos engineering and resilience testing
  • Scaling org-wide SLO/SRE frameworks
  • ML-assisted detection for signal tuning
  • Operating large-scale high-QPS systems

Key Requirements

  • 8+ years of relevant experience
  • Ownership of architecting production systems
  • Lead complex cross-team initiatives
  • Expertise in Go, Python, Rust, or Java
  • Hands-on Kubernetes and service mesh experience
  • Observability and monitoring expertise
  • Strong incident management lifecycle
  • Experience with reliability engineering patterns
  • Solid data analysis including SQL
  • Excellent communication and collaboration skills

Work Rights

Not specified

Sponsorship: available

Tailored Resume

Cover Letter