Software Development Engineer, EC2 UltraServer Availability

Amazon

Seattle, WA, US
Not specified; not specified; not specified
On-site
Aws native services
Repair and recovery workflows
Nvidia gb200 gb300 ultraservers
The role focuses on ensuring high availability of customer GB200 and GB300 UltraServers by orchestrating complex repair and recovery workflows from impairment detection through completion

Job Summary

  • The role focuses on ensuring high availability of customer GB200 and GB300 UltraServers by orchestrating complex repair and recovery workflows from impairment detection through completion.
  • Engineers will build stable, logical, and testable cloud-based solutions using AWS native services while managing hardware integrations specific to GPU clusters and AI/ML training systems.
  • This is a hands-on position where the engineer owns the entire lifecycle from requirements gathering and design reviews to implementation, operations, and driving continuous improvement.

Matching Summary

The role focuses on ensuring high availability of customer GB200 and GB300 UltraServers by orchestrating complex repair and recovery workflows from impairment detection through completion.

Salary

Not specified; Not specified; Not specified

Skills & Requirements

Must-have

  • AWS native services
  • repair and recovery workflows
  • NVIDIA GB200 GB300 UltraServers
  • system architecture design
  • hardware software integration
  • GPU cluster network partitioning

Nice-to-have

  • cross-functional collaboration skills
  • mentoring junior engineers
  • continuous improvement mindset
  • operational excellence focus
  • stakeholder management

Key Requirements

  • SDE II level experience
  • Expertise in AWS services
  • System architecture background

Work Rights

Not specified

Tailored Resume

Cover Letter