Site Reliability Engineer, Infrastructure - Analytics Platform

OpenAI

San Francisco, California, United States
Clickhouse cluster operations
Kafka pipeline management
Snowflake workflow experience
The team designs and operates critical infrastructure to accelerate research progress towards AGI at OpenAI

Job Summary

  • The team designs and operates critical infrastructure to accelerate research progress towards AGI at OpenAI.
  • This role requires owning production-critical infrastructure end-to-end with a focus on large-scale ClickHouse clusters and high-throughput Kafka pipelines.
  • Candidates must be able to independently define operational standards while remaining deeply hands-on in production systems.

Matching Summary

The team designs and operates critical infrastructure to accelerate research progress towards AGI at OpenAI.

Skills & Requirements

Must-have

  • ClickHouse cluster operations
  • Kafka pipeline management
  • Snowflake workflow experience
  • Infrastructure as Code (IaC)
  • Kubernetes and Terraform
  • Incident response and SLOs

Nice-to-have

  • Independent operational standards definition
  • High-pressure production environment rigor
  • Cross-team collaboration skills
  • Deep hands-on debugging mindset

Key Requirements

  • Track record of owning production infrastructure
  • Strong hands-on experience with ClickHouse and Kafka
  • Practical experience with Snowflake workflows

Work Rights

Not specified

Tailored Resume

Cover Letter