LLM Quality Testing Framework
Built an evaluation framework that reduced hallucinations by ~80% in a legal research product.
Key Results
The Problem
A legal tech startup had built an AI-powered case research tool. Users could ask questions like "What are the key precedents for wrongful termination in Ontario?" and get synthesized answers with citations. The product worked—mostly. But 1 in 5 responses contained factual errors: made-up case names, incorrect dates, or citations that didn't support the claims.
For a legal product, this was existential. One wrong citation could destroy user trust and expose clients to malpractice risk.
Pain points
- No systematic testing — QA was manual spot-checking by lawyers
- Silent failures — Wrong answers looked just as confident as right ones
- Model drift — Switching from GPT-4 to GPT-4-turbo introduced new failure modes
- No baseline — "Is this better?" was answered by gut feel, not data
The Intervention
We built a comprehensive evaluation framework covering accuracy, groundedness, and citation quality:
Eval dimensions
- Factual accuracy — Are claims true? Method: LLM-as-judge against source docs
- Groundedness — Is every claim supported by retrieved context? Method: NLI model + citation verification
- Citation precision — Do citations actually say what's claimed? Method: Extractive matching + semantic similarity
- Completeness — Are key precedents included? Method: Golden answer comparison
- Hallucination rate — % of responses with fabricated content. Method: Multi-model consensus check
Test suite architecture
Pipeline flow:
- Test Case (question + golden answer + source docs)
- RAG Pipeline processes the question
- Response + Citations generated
- Eval Suite runs all checks:
- FactualAccuracyEval
- GroundnessEval
- CitationPrecisionEval
- CompletenessEval
- Scores + Failure Analysis output
Test corpus
- 200 curated questions across 8 legal domains
- Golden answers written by practicing lawyers
- Source documents — 500+ cases from CanLII
- Adversarial examples — Questions designed to trigger hallucinations
Stack
- Eval framework: Custom Python, inspired by RAGAS and DeepEval
- LLM judges: GPT-4 + Claude for cross-validation
- NLI model: DeBERTa-v3 for groundedness checks
- CI integration: GitHub Actions runs evals on every PR
- Dashboard: Streamlit for exploring failures
The Outcome
Before/after metrics
Before → After:
- Hallucination rate: 18% → 3.5%
- Citation accuracy: 71% → 94%
- Groundedness score: 0.68 → 0.91
- Factual accuracy: 79% → 93%
What changed
- Retrieval improvements — Eval revealed that 60% of hallucinations stemmed from poor retrieval, not generation
- Prompt engineering — Added explicit "only cite if directly supports claim" instruction
- Post-processing — Citation verification step removes unsupported claims before response
- Model selection — Data showed Claude 3.5 had 40% fewer hallucinations than GPT-4-turbo for this domain
Ongoing value
- Regression prevention — Every code change tested against full suite
- Model comparison — Objective data for evaluating new models
- Failure analysis — Weekly review of worst-performing cases drives improvements
Key Learnings
- You can't improve what you don't measure — Gut-feel QA missed systematic failure patterns
- Retrieval > generation — Most "hallucinations" were actually retrieval failures
- Domain-specific evals — Generic benchmarks didn't predict legal accuracy
- LLM-as-judge works — But needs calibration and cross-validation
- CI integration is essential — Evals only matter if they run automatically
Engagement type: Reliability Upgrade
Timeline: 3 weeks from kickoff to production eval suite
This case study illustrates our capabilities with a representative scenario. Details have been generalized to protect client confidentiality.
