As a foundational member of the reliability function within the Data Engineering organization, you will ensure the availability, performance, and resilience of high-throughput telemetry and analytics platforms
Job Summary
As a foundational member of the reliability function within the Data Engineering organization, you will ensure the availability, performance, and resilience of high-throughput telemetry and analytics platforms.
You will play a key role in designing systems where resilience, automation, and observability are built in from the start.
We are looking for engineers who are uncomfortable with manual toil and are driven to build platforms where scaling, recovery, and operational insight are inherent properties of the system architecture.
Matching Summary
As a foundational member of the reliability function within the Data Engineering organization, you will ensure the availability, performance, and resilience of high-throughput telemetry and analytics platforms.
Skills & Requirements
Must-have
Site Reliability Engineering experience
DevOps or Platform Engineering
Linux systems administration
Cloud-native infrastructure
High-throughput data platforms
Infrastructure as Code (Terraform)
Observability stacks (Prometheus, Grafana)
Nice-to-have
Data engineering platforms support
Kubernetes and container orchestration
Stream processing frameworks
Real-time telemetry environments
Multi-cloud or hybrid cloud platforms
Key Requirements
Proven experience in SRE, DevOps, or Platform Engineering
Strong experience with Linux systems administration
Experience operating high-throughput data platforms or streaming systems
Hands-on experience with Infrastructure as Code tools