Site Reliability Engineer, Infrastructure - Analytics Platform
OpenAI
San Francisco, California, United States
Clickhouse cluster operations
Kafka pipeline management
Snowflake workflow experience
The team designs and operates critical infrastructure to accelerate research progress towards AGI at OpenAI
Job Summary
The team designs and operates critical infrastructure to accelerate research progress towards AGI at OpenAI.
This role requires owning production-critical infrastructure end-to-end with a focus on large-scale ClickHouse clusters and high-throughput Kafka pipelines.
Candidates must be able to independently define operational standards while remaining deeply hands-on in production systems.
Matching Summary
The team designs and operates critical infrastructure to accelerate research progress towards AGI at OpenAI.
Skills & Requirements
Must-have
ClickHouse cluster operations
Kafka pipeline management
Snowflake workflow experience
Infrastructure as Code (IaC)
Kubernetes and Terraform
Incident response and SLOs
Nice-to-have
Independent operational standards definition
High-pressure production environment rigor
Cross-team collaboration skills
Deep hands-on debugging mindset
Key Requirements
Track record of owning production infrastructure
Strong hands-on experience with ClickHouse and Kafka