Production track record in multimodal architectures
Deep understanding of vision and language representations
This role involves pioneering the integration of Vision-Language Models directly into the FLUX generation stack to improve controllability and alignment
Job Summary
This role involves pioneering the integration of Vision-Language Models directly into the FLUX generation stack to improve controllability and alignment.
Candidates must have a proven track record of pretraining or significantly advancing VLMs rather than just fine-tuning existing models.
The team values deep scientific understanding, low ego collaboration, and bold execution while maintaining a balance between research excellence and shipping products.
Matching Summary
This role involves pioneering the integration of Vision-Language Models directly into the FLUX generation stack to improve controllability and alignment.
Skills & Requirements
Must-have
Pretrained or significantly advanced VLM
Production track record in multimodal architectures
Deep understanding of vision and language representations
Experience with distributed multi-node training
Nice-to-have
Experience with diffusion or flow-based generative models
Knowledge of autoregressive and diffusion paradigm composition
Strong publication record in frontier research
Key Requirements
Staff or Senior Individual Contributor level experience
Proven deployment of pretrained VLMs in production systems
Demonstrated ability to push the frontier on multimodal architectures