Site Reliability Engineer (sre)

Thinking Machines Lab

San Francisco, California, United States
Base: $350,000 – $475,000 usd; bonus/equity: not s...
On-site
Distributed systems experience
Cloud infrastructure proficiency
Production incident response
Thinking Machines Lab aims to empower humanity by advancing collaborative general intelligence and providing access to frontier AI tools

Job Summary

  • Thinking Machines Lab aims to empower humanity by advancing collaborative general intelligence and providing access to frontier AI tools.
  • The Site Reliability Engineer will define end-to-end reliability for the Tinker platform, balancing job completion with development velocity.
  • The role offers a competitive salary range of $350,000 – $475,000 USD along with generous benefits including unlimited PTO and visa sponsorship.

Matching Summary

Thinking Machines Lab aims to empower humanity by advancing collaborative general intelligence and providing access to frontier AI tools.

Salary

Base: $350,000 – $475,000 USD; Bonus/Equity: Not specified; Benefits: Health, dental, vision, unlimited PTO, paid parental leave, relocation support

Skills & Requirements

Must-have

  • Distributed systems experience
  • Cloud infrastructure proficiency
  • Production incident response
  • Software tooling for reliability
  • Systematic reliability improvement

Nice-to-have

  • Large-scale cloud service operations
  • Distributed training framework knowledge
  • Checkpoint and recovery system building
  • Kubernetes cluster tuning at scale
  • Heterogeneous GPU workload management

Key Requirements

  • Bachelor's degree in computer science or equivalent
  • Experience with production incident response and postmortems
  • Strong communication skills for cross-team coordination

Work Rights

Not specified

Sponsorship: available

Tailored Resume

Cover Letter