The role focuses on owning the evaluation discipline that makes AI tools trustworthy enough for experts to rely on in litigation, investigations, and analytics
Job Summary
The role focuses on owning the evaluation discipline that makes AI tools trustworthy enough for experts to rely on in litigation, investigations, and analytics.
You will design frameworks, benchmarks, and guardrails to detect hallucinations and prevent unsafe outputs before they reach consultants.
The position requires partnering with infrastructure engineers to operationalize evaluation within CI/CD so no model ships without passing quality gates.
Matching Summary
The role focuses on owning the evaluation discipline that makes AI tools trustworthy enough for experts to rely on in litigation, investigations, and analytics.
Skills & Requirements
Must-have
3+ years ML evaluation experience
Python fundamentals with LLM APIs
Building testing and benchmarking pipelines
LLM failure modes and hallucination detection
Automated metrics like ROUGE and BERTScore
Nice-to-have
Production RAG or agentic pipeline evaluation
Adversarial red-teaming and prompt injection
Observability tooling like Grafana or Datadog
AI compliance in regulated industries
Model interpretability techniques
Key Requirements
3+ years in ML evaluation or AI safety
Strong Python skills
Experience with LLM APIs (OpenAI, Anthropic)
Familiarity with evaluation frameworks (Ragas, DeepEval, Promptfoo)