Terraform or cloudformation infrastructure as code
Aws or google cloud platform operations
The role involves partnering with enterprise governors to define and implement key reliability, security, and resilience requirements across the platform
Job Summary
The role involves partnering with enterprise governors to define and implement key reliability, security, and resilience requirements across the platform.
Candidates will be responsible for patternizing resilience capabilities into tools and services that help customers build fault-tolerant systems.
The position requires driving incident management efforts by applying industry best practices and maturing the problem management lifecycle through standardized runbooks.
Matching Summary
The role involves partnering with enterprise governors to define and implement key reliability, security, and resilience requirements across the platform.
Skills & Requirements
Must-have
Python programming for automation tools
Terraform or CloudFormation Infrastructure as Code
AWS or Google Cloud Platform operations
Resilience and disaster recovery capabilities
CI/CD pipeline development with GitHub Jenkins
Observability solutions using Splunk
ITIL-based incident and problem management
Nice-to-have
Performance engineering for ML/AI solutions
Snowflake or RDBMS data reliability knowledge
Fault-tolerant system design expertise
FMEA resilience frameworks familiarity
Multi-region deployment strategies
Serverless architecture experience
Key Requirements
4+ years Python programming experience
4+ years Infrastructure as Code experience
2-3 years public cloud operations experience
4+ years resilience and reliability design experience
2+ years CI/CD pipeline building experience
4+ years reliability engineering concepts application