The role involves leading the research and implementation of a document conversion pipeline to transform unstructured PDFs into formatted .docx files
Job Summary
The role involves leading the research and implementation of a document conversion pipeline to transform unstructured PDFs into formatted .docx files.
Candidates will perform comparative analysis between commercial and open-source AI solutions while establishing metrics for format fidelity.
This is a hybrid role requiring both strategic decision-making and hands-on development within a diverse global team across the USA, Spain, Portugal, and India.
Matching Summary
Match Score: 85
The role involves leading the research and implementation of a document conversion pipeline to transform unstructured PDFs into formatted .docx files.
Skills & Requirements
Must-have
Expert-level Python skills
Experience with OpenCV and PyMuPDF
Deep familiarity with Tesseract and PaddleOCR
Knowledge of LayoutLMv3, Donut, or Nougat models
Understanding of OOXML document formats
Experience integrating GPT or Claude models
Nice-to-have
Experience with Pandoc AST
Background in DTP, Typography or Graphic Design
Contributions to open-source OCR projects
Key Requirements
Expert-level Python proficiency required
Deep familiarity with modern Transformer-based document models
Architectural vision for API vs custom pipeline decisions