Research Infrastructure Engineer, Training Systems

OpenAI

San Francisco, California, United States
Remote
Strong systems engineering instincts
Experience with distributed systems and gpus
Proficiency in python and pytorch
The team builds infrastructure that turns novel research ideas into runnable, measurable training workloads for large models

Job Summary

  • The team builds infrastructure that turns novel research ideas into runnable, measurable training workloads for large models.
  • Engineers will design APIs and interfaces that make complex training workflows easier to express and harder to misuse.
  • This role involves debugging issues spanning Python, PyTorch, distributed systems, GPUs, networking, and storage to ensure reliability.

Matching Summary

The team builds infrastructure that turns novel research ideas into runnable, measurable training workloads for large models.

Skills & Requirements

Must-have

  • Strong systems engineering instincts
  • Experience with distributed systems and GPUs
  • Proficiency in Python and PyTorch
  • Ability to design clean API abstractions
  • Debugging skills using profiles and traces

Nice-to-have

  • Empathy for researchers and engineers
  • Interest in frontier model research
  • Experience with large-scale experimentation
  • Focus on performance optimization
  • Comfort with production-quality infrastructure

Work Rights

Not specified

Tailored Resume

Cover Letter