Sr. Technical Program Manager - AI/ML Hardware Health & Stability, Global Data Center Operations

Amazon

Seattle, WA, US
On-site
Ai/ml hardware platform experience
Data center operations management
Cross-functional program leadership
This role serves as the central owner of operational health metrics including failure rates, repair efficacy, and dwell time for new AI/ML hardware platforms

Job Summary

  • This role serves as the central owner of operational health metrics including failure rates, repair efficacy, and dwell time for new AI/ML hardware platforms.
  • The position requires leading cross-functional investigations and post-mortem processes to translate lessons learned into preventive design improvements.
  • Candidates will act as a strategic advisor bridging hardware engineering, data center operations, and service teams to optimize capacity delivery.

Matching Summary

This role serves as the central owner of operational health metrics including failure rates, repair efficacy, and dwell time for new AI/ML hardware platforms.

Skills & Requirements

Must-have

  • AI/ML hardware platform experience
  • Data center operations management
  • Cross-functional program leadership
  • Hardware failure root cause analysis
  • Operational health KPI ownership

Nice-to-have

  • Strategic account management skills
  • Executive stakeholder communication
  • Supply chain coordination experience
  • Proactive risk mitigation strategies
  • Global infrastructure deployment knowledge

Key Requirements

  • Senior Technical Program Manager experience
  • Background in hardware engineering or operations
  • Experience with GenAI or high-performance computing infrastructure

Work Rights

Not specified

Tailored Resume

Cover Letter