Systems Development Engineer, Research Compute Platform, Fauna

Amazon

New York, NY, US
142,300.00 - 192,400.00 usd annually py
On-site
On-prem gpu compute management
Job scheduling layer implementation
Slurm ray skypilot experience
This role involves owning the end-to-end physical and virtual infrastructure used by ML scientists to train reinforcement learning policies for real robots

Job Summary

  • This role involves owning the end-to-end physical and virtual infrastructure used by ML scientists to train reinforcement learning policies for real robots.
  • The ideal candidate must be equally comfortable diagnosing hardware issues like GPU thermal faults as they are designing complex job scheduling systems.
  • Fauna Robotics is an Amazon company dedicated to building capable, safe, and delightful robots that people want to interact with in everyday spaces.

Matching Summary

This role involves owning the end-to-end physical and virtual infrastructure used by ML scientists to train reinforcement learning policies for real robots.

Salary

142,300.00 - 192,400.00 USD annually

Skills & Requirements

Must-have

  • On-prem GPU compute management
  • Job scheduling layer implementation
  • Slurm Ray SkyPilot experience
  • CUDA driver and hardware troubleshooting
  • Cloud burst capacity design

Nice-to-have

  • Comfort working alongside researchers
  • Diagnosing GPU thermal faults
  • Designing job schedulers
  • Profiling ML workloads

Key Requirements

  • Strong systems engineering fundamentals
  • Experience with GPU provisioning and imaging
  • Ability to triage training issues with scientists

Work Rights

Not specified

Tailored Resume

Cover Letter