Sr. Technical Program Manager - AI/ML Hardware Health & Stability, Global Data Center Operations
Amazon
Seattle, WA, US
On-site
Ai/ml hardware platform experience
Data center operations management
Cross-functional program leadership
This role serves as the central owner of operational health metrics including failure rates, repair efficacy, and dwell time for new AI/ML hardware platforms
Job Summary
This role serves as the central owner of operational health metrics including failure rates, repair efficacy, and dwell time for new AI/ML hardware platforms.
The position requires leading cross-functional investigations and post-mortem processes to translate lessons learned into preventive design improvements.
Candidates will act as a strategic advisor bridging hardware engineering, data center operations, and service teams to optimize capacity delivery.
Matching Summary
This role serves as the central owner of operational health metrics including failure rates, repair efficacy, and dwell time for new AI/ML hardware platforms.
Skills & Requirements
Must-have
AI/ML hardware platform experience
Data center operations management
Cross-functional program leadership
Hardware failure root cause analysis
Operational health KPI ownership
Nice-to-have
Strategic account management skills
Executive stakeholder communication
Supply chain coordination experience
Proactive risk mitigation strategies
Global infrastructure deployment knowledge
Key Requirements
Senior Technical Program Manager experience
Background in hardware engineering or operations
Experience with GenAI or high-performance computing infrastructure