Machine Learning (Ops) Engineer

NEWBRIDGE ALLIANCE PTE. LTD.

Singapore
Python, go, or java programming proficiency
Kubernetes and docker production experience
Model serving and distributed training expertise
The role involves building and operating low-latency, high-throughput online inference services for deep learning and LLM models

Job Summary

  • The role involves building and operating low-latency, high-throughput online inference services for deep learning and LLM models.
  • Candidates will own systems for distributed training on GPU clusters using Kubernetes, Ray, DeepSpeed, or Megatron.
  • The team operates at the scale of major social-commerce platforms, enabling over 100 ML scientists and engineers.

Matching Summary

Match Score: 85

The role involves building and operating low-latency, high-throughput online inference services for deep learning and LLM models.

Skills & Requirements

Must-have

  • Python, Go, or Java programming proficiency
  • Kubernetes and Docker production experience
  • Model serving and distributed training expertise
  • Spark, Kafka, or large-scale data tools
  • Cloud platform AWS/GCP/Azure knowledge

Nice-to-have

  • Deep expertise in GPU inference optimization
  • Experience with LLM fine-tuning pipelines
  • Background in high-QPS online services
  • Contributions to open-source ML infra projects
  • Knowledge of PyTorch, TensorFlow, or JAX internals

Key Requirements

  • BS/MS in Computer Science or related field
  • 3+ years building ML infrastructure or platform services
  • Strong software design and testing skills

Work Rights

Not specified

Tailored Resume

Cover Letter