Apply software engineering techniques, automation, and best practices in incident response, to ensure the reliability, availability, and scalability of the systems, platforms, and technology
Job Summary
Apply software engineering techniques, automation, and best practices in incident response, to ensure the reliability, availability, and scalability of the systems, platforms, and technology.
Develop tools and scripts to automate operational processes, reducing manual workload, increasing efficiency, and improving system resilience.
Collaborate with development teams to integrate best practices for reliability, scalability, and performance into the software development lifecycle.
Matching Summary
Apply software engineering techniques, automation, and best practices in incident response, to ensure the reliability, availability, and scalability of the systems, platforms, and technology.
Skills & Requirements
Must-have
Python or Java + Bash
Terraform and/or CloudFormation
Jenkins and/or GitLab CI/CD
Elastic/Grafana/Prometheus
Linux troubleshooting
Incident response automation
Production ops ownership
Nice-to-have
Kubernetes/container troubleshooting
AWS enterprise exposure
Regulated environment experience
Zero-downtime deployment patterns
Key Requirements
3–7 years in SRE/DevOps/Platform Engineering
Fundamentals of networking/security/distributed systems