Site Reliability Engineer, Machine Learning Systems - Singapore

BYTEDANCE PTE. LTD.

Singapore
Hybrid
Bachelor's degree in computer science
Proficiency in go, python, or shell
Kubernetes and container experience
ByteDance is seeking a Site Reliability Engineer for Machine Learning Systems in Singapore, focused on maintaining and optimizing large-scale ML systems. The ideal candidate will have a strong background in programming, operations of distributed systems, and a passion for innovation within a diverse team

Job Summary

  • The ByteDance Large Model Team is committed to developing the most advanced AI large model technology in the industry.
  • You will build large-scale heterogeneous systems integrating GPU/NPU/RDMA/Storage and ensure they run steadily and reliably.
  • The role offers a positive team atmosphere with career growth opportunities and paid leave within a flat organization.

Matching Summary

Match Score: 85

ByteDance is seeking a Site Reliability Engineer for Machine Learning Systems in Singapore, focused on maintaining and optimizing large-scale ML systems. The ideal candidate will have a strong background in programming, operations of distributed systems, and a passion for innovation within a diverse team.

Skills & Requirements

Must-have

  • Bachelor's degree in Computer Science
  • Proficiency in Go, Python, or Shell
  • Kubernetes and container experience
  • Linux environment operations
  • Distributed system maintenance

Nice-to-have

  • Large-scale ML distributed system experience
  • GPU server operation and maintenance
  • Strong logical analysis ability
  • Excellent documentation habits
  • Global team collaboration skills

Key Requirements

  • Bachelor's degree or above
  • Computer Science or related major
  • 1+ year Kubernetes O&M experience

Work Rights

Not specified

Tailored Resume

Cover Letter