Sr. System Development Engineer, Cloud AI/ML/storage server teams

Amazon

Cupertino, CA, US
Not specified; not specified; not specified
On-site
Linux kernel driver debugging
Python ruby java c/c++ programming
Hardware failure root cause analysis
You will lead the development of automation software and diagnostic tooling to maintain the health of AWS storage and AI/ML compute fleets

Job Summary

  • You will lead the development of automation software and diagnostic tooling to maintain the health of AWS storage and AI/ML compute fleets.
  • The role requires decomposing complex server testability and reliability problems into straightforward tasks while driving delivery through hardware, software, and system design knowledge.
  • You will collaborate with internal engineering teams and external ODMs to ensure new server designs meet rigorous testability and automation requirements throughout the lifecycle.

Matching Summary

You will lead the development of automation software and diagnostic tooling to maintain the health of AWS storage and AI/ML compute fleets.

Salary

Not specified; Not specified; Not specified

Skills & Requirements

Must-have

  • Linux kernel driver debugging
  • Python Ruby Java C/C++ programming
  • Hardware failure root cause analysis
  • Fleet health predictive infrastructure
  • PCIe NVMe GPU subsystem troubleshooting
  • x86 and ARM architecture expertise

Nice-to-have

  • Experience with ODM design partners
  • Zero-touch operations vision
  • Cross-team technical leadership
  • CI/CD pipeline management
  • Server conception and qualification
  • Scalable system design skills

Key Requirements

  • Systems Development Engineer experience
  • Proficiency in Linux internals and drivers
  • Background in hardware diagnostics
  • Experience with server fleet operations

Work Rights

Not specified

Tailored Resume

Cover Letter