Lead Software Engineer, Reliability

Klaviyo

Dublin, Ireland
On-site
Technical leadership for reliability
Design and operate production systems
Cloud-native, platform-focused sre
Set the technical vision and long-term strategy for reliability, availability, and operational excellence across critical platforms

Job Summary

  • Set the technical vision and long-term strategy for reliability, availability, and operational excellence across critical platforms.
  • Drive adoption of SRE best practices across engineering teams, including SLIs, SLOs, error budgets, and reliability-based decision making.
  • Mentor senior and mid-level engineers, raising the bar for technical quality, operational maturity, and reliability culture across the organization.

Matching Summary

Set the technical vision and long-term strategy for reliability, availability, and operational excellence across critical platforms.

Skills & Requirements

Must-have

  • technical leadership for reliability
  • design and operate production systems
  • cloud-native, platform-focused SRE
  • production-quality code (Python, Go)
  • distributed, cloud-native systems
  • containerized workloads and platforms (Kubernetes)
  • observability platforms and alerting strategies
  • SRE concepts (SLIs, SLOs, error budgets)
  • infrastructure as code (Terraform)
  • capacity planning and performance analysis
  • incident response for high-severity events
  • mentor senior and mid-level engineers

Nice-to-have

  • leading critical platforms or internal tooling
  • identity, access management, secrets management
  • operating systems at scale in cloud environments
  • resilience testing, fault injection, chaos engineering
  • algorithms and data structures for large-scale systems
  • experimented with AI in work or personal projects

Key Requirements

  • Senior technical leader
  • Deep systems expertise
  • Strong judgment and influence
  • Cloud-native, platform-focused SRE
  • Production-quality code (e.g. Python, Go, or similar)
  • Led design and operation of distributed, cloud-native systems
  • Extensive experience operating containerized workloads and platforms (e.g. Kubernetes)
  • Owning on-call strategy and participating in escalation
  • Designed and evolved observability platforms and alerting strategies
  • Apply SRE concepts such as SLIs, SLOs, error budgets
  • Strong hands-on experience with infrastructure as code and declarative configuration
  • Led capacity planning, load testing, and performance analysis
  • Drive high-quality post-incident reviews
  • Comfortable leading technical discussions

Work Rights

Not specified

Sponsorship: available

Tailored Resume

Cover Letter