Reliability Lead, Common Services

CoreWeave

New York, NY, USA
On-site
Reliability engineering practice
Production operations
Incident management lifecycle
Establish and lead the SRE / production engineering practice for the Common Services organization, including standards for reliability, incident management, and on-call

Job Summary

  • Establish and lead the SRE / production engineering practice for the Common Services organization, including standards for reliability, incident management, and on-call.
  • Own and improve the incident management lifecycle for Common Services, including on-call rotations, escalation paths, incident tooling, post-incident reviews, and follow-through on corrective actions.
  • Hire, mentor, and develop SRE and production engineering talent, fostering a culture of continuous improvement, learning from incidents, and humane on-call.

Matching Summary

Establish and lead the SRE / production engineering practice for the Common Services organization, including standards for reliability, incident management, and on-call.

Skills & Requirements

Must-have

  • Reliability Engineering practice
  • production operations
  • incident management lifecycle
  • observability strategy
  • design for reliability
  • automate operational workflows

Nice-to-have

  • humane on-call culture
  • data-driven decision making
  • GPU workloads experience

Key Requirements

  • 7+ years SRE/Production Engineering experience
  • 2+ years technical leadership
  • Linux-based production environments
  • observability stacks experience
  • running on-call rotations
  • infrastructure-as-code and automation tooling

Work Rights

Not specified

Tailored Resume

Cover Letter