Architect multimodal rag systems for video and pdfs
The role focuses on developing AI systems that seamlessly integrate vision and audio models to enhance voice-to-voice interactions
Job Summary
The role focuses on developing AI systems that seamlessly integrate vision and audio models to enhance voice-to-voice interactions.
Candidates will be responsible for architecting multimodal RAG systems capable of retrieving insights from videos and PDFs.
This position requires expertise in optimizing streaming latency and integrating advanced models like Whisper and CLIP into core agent reasoning loops.
Matching Summary
The role focuses on developing AI systems that seamlessly integrate vision and audio models to enhance voice-to-voice interactions.
Skills & Requirements
Must-have
Integrate vision encoders and audio-native models
Optimize streaming latency for voice interactions
Architect multimodal RAG systems for video and PDFs
Experience with Whisper, CLIP, and multimodal LLMs