Drive reliability, scalability, and performance across cloud-based infrastructure in a distributed, dynamic environment
Job Summary
Drive reliability, scalability, and performance across cloud-based infrastructure in a distributed, dynamic environment.
Design and operate reliable cloud platforms and services by applying SRE principles, automation-first practices, and strong operational discipline.
Collaborate with Engineering, Architecture, DevOps, and Security to improve uptime, accelerate detection and recovery, and continuously harden systems.
Matching Summary
Drive reliability, scalability, and performance across cloud-based infrastructure in a distributed, dynamic environment.
Skills & Requirements
Must-have
Cloud platforms (AWS, Azure, GCP)
Infrastructure as Code (IaC)
Observability stacks and monitoring
Incident response and troubleshooting
Automation and scripting (Python, PowerShell, Bash)
Containers and orchestration (Docker, Kubernetes)
Nice-to-have
Chaos engineering or resilience testing
Multicloud and hybrid cloud deployments
SLOs, SLIs, and error budgets
Gathering operational feedback
Key Requirements
Hands-on programming/scripting experience
Strong experience with one or more cloud platforms
Experience with containers and orchestration technologies
Proficiency in Infrastructure as Code (IaC)
Experience with observability stacks
Practical experience using cloud data migration tools
Advanced knowledge of Windows and Linux/Unix environments