Research Scientist – Speech And Audio Understanding (large Models & Multimodal Systems)

Tencent Music Entertainment Group

Bellevue, Washington, US
Base: $122,500.00 to $229,700.00 py; bonus/equity:...
**
Phd in computer science or related field
Speech and audio signal processing expertise
Deep learning frameworks like pytorch or tensorflow
** Tencent Music Entertainment Group is seeking a Research Scientist specializing in speech and audio understanding within large models and multimodal systems. The role involves developing advanced multimodal models that integrate audio, text, and vision, along with managing high-quality datasets in this domain. **

Job Summary

  • The role involves building native multimodal model systems that jointly support vision, audio, and text for comprehensive world perception.
  • Candidates will contribute to developing general-purpose end-to-end large speech models covering multilingual ASR, translation, and synthesis.
  • Employees are eligible for a sign-on payment, relocation package, restricted stock units, and up to 15-25 days of vacation per year.

Matching Summary

Match Score: 75

** Tencent Music Entertainment Group is seeking a Research Scientist specializing in speech and audio understanding within large models and multimodal systems. The role involves developing advanced multimodal models that integrate audio, text, and vision, along with managing high-quality datasets in this domain. **

Salary

Base: $122,500.00 to $229,700.00 per year; Bonus/Equity: Sign-on payment and restricted stock units available; Benefits: Medical, dental, vision, life, disability, 401(k), and paid leave

Skills & Requirements

Must-have

  • PhD in Computer Science or related field
  • Speech and audio signal processing expertise
  • Deep learning frameworks like PyTorch or TensorFlow
  • Transformer-based architecture knowledge
  • Multilingual automatic speech recognition experience

Nice-to-have

  • Experience with distributed training systems
  • Background in data synthesis technologies
  • Familiarity with Wav2Vec or HuBERT models
  • Cross-modal modeling experience
  • State-of-the-art performance on audio tasks

Key Requirements

  • PhD in Computer Science, EE, AI, or Linguistics
  • Master's degree with several years of relevant experience
  • Proficiency in ASR, TTS, or speech translation pipelines

Work Rights

Not specified

Tailored Resume

Cover Letter