Sr Site Reliability Engineer

GHX

Hyderabad, India
On-site
Monitoring and observability solutions
Troubleshooting production issues
24x7 on-call operations
The Site Reliability Engineer (SRE) will be a hands-on contributor within the Site Reliability Engineering Center of Excellence (CoE), responsible for building monitoring and observability solutions, troubleshooting production issues, and participating in 24x7 on-call operations

Job Summary

  • The Site Reliability Engineer (SRE) will be a hands-on contributor within the Site Reliability Engineering Center of Excellence (CoE), responsible for building monitoring and observability solutions, troubleshooting production issues, and participating in 24x7 on-call operations.
  • This role focuses on the execution of reliability practices, implementing observability tooling, improving MTTR/MTTD through automation, and ensuring production systems are resilient, observable, and performant.
  • Collaborate with Engineering, Product, Security, Cloud, and DevSecOps teams to embed reliability practices and provide input on instrumentation, monitoring hooks, and operational readiness for services.

Matching Summary

The Site Reliability Engineer (SRE) will be a hands-on contributor within the Site Reliability Engineering Center of Excellence (CoE), responsible for building monitoring and observability solutions, troubleshooting production issues, and participating in 24x7 on-call operations.

Skills & Requirements

Must-have

  • monitoring and observability solutions
  • troubleshooting production issues
  • 24x7 on-call operations
  • implementing observability tooling
  • automation for MTTR/MTTD
  • AWS services (EC2, ECS, EKS)
  • containers & Kubernetes (Docker, EKS)

Nice-to-have

  • SaaS or healthcare environments
  • database observability and performance
  • chaos engineering and resiliency testing

Key Requirements

  • 7+ years in SRE, Operations, or Infrastructure Engineering
  • Strong hands-on experience with monitoring and observability platforms
  • Experience with New Relic, Datadog, Prometheus, Grafana, CloudWatch, OpenTelemetry, Graylog
  • Proven experience in incident response, troubleshooting production issues
  • Good knowledge of SLIs, SLOs, SLAs, and error budgets
  • Scripting/programming in Python, Go, or shell scripting
  • Understanding of networking, distributed systems, and high-availability architectures
  • Exposure to ITIL/ITSM processes

Work Rights

Not specified

Tailored Resume

Cover Letter