Inference Optimization Architect, Speech Ai

NVIDIA

Not specified
Deep learning model inference optimization
Cuda kernel development experience
Model serving deployment with triton
NVIDIA is seeking an Inference Optimization Architect for its Speech AI team, focusing on enhancing the performance of Speech AI models by optimizing inference latency and resource utilization. The ideal candidate will have extensive experience in deep learning model optimization, particularly in areas such as CUDA development and model serving

Job Summary

  • This role focuses on accelerating and scaling Speech AI models to improve customer experiences through reduced latency and improved throughput.
  • The successful candidate will implement advanced optimization techniques including quantization, pruning, and custom CUDA kernel development.
  • NVIDIA seeks a creative professional passionate about solving real-world conversational AI problems within a dynamic team environment.

Matching Summary

Match Score: 85

NVIDIA is seeking an Inference Optimization Architect for its Speech AI team, focusing on enhancing the performance of Speech AI models by optimizing inference latency and resource utilization. The ideal candidate will have extensive experience in deep learning model optimization, particularly in areas such as CUDA development and model serving.

Skills & Requirements

Must-have

  • Deep learning model inference optimization
  • CUDA kernel development experience
  • Model serving deployment with Triton
  • Quantization pruning and distillation techniques
  • GPU profiling using Nsight Systems

Nice-to-have

  • Publications or open-source contributions
  • Experience with embedded systems edge deployment
  • Strong collaborative matrix environment skills

Key Requirements

  • Masters or BE/BTech in Computer Science
  • 10+ years total experience required
  • 5+ years on deep learning inference optimization
  • Solid understanding of Transformers CNNs RNNs

Work Rights

Not specified

Tailored Resume

Cover Letter