Ai Evaluation & Safety Engineer

Berkeley Research Group

3+ years ml evaluation experience
Python fundamentals with llm apis
Building testing and benchmarking pipelines
The role focuses on owning the evaluation discipline that makes AI tools trustworthy enough for experts to rely on in litigation, investigations, and analytics

Job Summary

  • The role focuses on owning the evaluation discipline that makes AI tools trustworthy enough for experts to rely on in litigation, investigations, and analytics.
  • You will design frameworks, benchmarks, and guardrails to detect hallucinations and prevent unsafe outputs before they reach consultants.
  • The position requires partnering with infrastructure engineers to operationalize evaluation within CI/CD so no model ships without passing quality gates.

Matching Summary

The role focuses on owning the evaluation discipline that makes AI tools trustworthy enough for experts to rely on in litigation, investigations, and analytics.

Skills & Requirements

Must-have

  • 3+ years ML evaluation experience
  • Python fundamentals with LLM APIs
  • Building testing and benchmarking pipelines
  • LLM failure modes and hallucination detection
  • Automated metrics like ROUGE and BERTScore

Nice-to-have

  • Production RAG or agentic pipeline evaluation
  • Adversarial red-teaming and prompt injection
  • Observability tooling like Grafana or Datadog
  • AI compliance in regulated industries
  • Model interpretability techniques

Key Requirements

  • 3+ years in ML evaluation or AI safety
  • Strong Python skills
  • Experience with LLM APIs (OpenAI, Anthropic)
  • Familiarity with evaluation frameworks (Ragas, DeepEval, Promptfoo)

Work Rights

Not specified

Tailored Resume

Cover Letter