← All Work·Legal Tech·3 weeks·Oct 2025

LLM Quality Testing Framework

Built an evaluation framework that reduced hallucinations by ~80% in a legal research product.

Reliability UpgradeEvaluation FrameworkCI/CD Integration

Key Results

Hallucination Reduction
80%
Citation Accuracy
94%
Groundedness Score
0.91
Factual Accuracy
93%

The Problem

A legal tech startup had built an AI-powered case research tool. Users could ask questions like "What are the key precedents for wrongful termination in Ontario?" and get synthesized answers with citations. The product worked—mostly. But 1 in 5 responses contained factual errors: made-up case names, incorrect dates, or citations that didn't support the claims.

For a legal product, this was existential. One wrong citation could destroy user trust and expose clients to malpractice risk.

Pain points

  • No systematic testing — QA was manual spot-checking by lawyers
  • Silent failures — Wrong answers looked just as confident as right ones
  • Model drift — Switching from GPT-4 to GPT-4-turbo introduced new failure modes
  • No baseline — "Is this better?" was answered by gut feel, not data

The Intervention

We built a comprehensive evaluation framework covering accuracy, groundedness, and citation quality:

Eval dimensions

  • Factual accuracy — Are claims true? Method: LLM-as-judge against source docs
  • Groundedness — Is every claim supported by retrieved context? Method: NLI model + citation verification
  • Citation precision — Do citations actually say what's claimed? Method: Extractive matching + semantic similarity
  • Completeness — Are key precedents included? Method: Golden answer comparison
  • Hallucination rate — % of responses with fabricated content. Method: Multi-model consensus check

Test suite architecture

Pipeline flow:

  1. Test Case (question + golden answer + source docs)
  2. RAG Pipeline processes the question
  3. Response + Citations generated
  4. Eval Suite runs all checks:
    • FactualAccuracyEval
    • GroundnessEval
    • CitationPrecisionEval
    • CompletenessEval
  5. Scores + Failure Analysis output

Test corpus

  • 200 curated questions across 8 legal domains
  • Golden answers written by practicing lawyers
  • Source documents — 500+ cases from CanLII
  • Adversarial examples — Questions designed to trigger hallucinations

Stack

  • Eval framework: Custom Python, inspired by RAGAS and DeepEval
  • LLM judges: GPT-4 + Claude for cross-validation
  • NLI model: DeBERTa-v3 for groundedness checks
  • CI integration: GitHub Actions runs evals on every PR
  • Dashboard: Streamlit for exploring failures

The Outcome

Before/after metrics

Before → After:

  • Hallucination rate: 18% → 3.5%
  • Citation accuracy: 71% → 94%
  • Groundedness score: 0.68 → 0.91
  • Factual accuracy: 79% → 93%

What changed

  1. Retrieval improvements — Eval revealed that 60% of hallucinations stemmed from poor retrieval, not generation
  2. Prompt engineering — Added explicit "only cite if directly supports claim" instruction
  3. Post-processing — Citation verification step removes unsupported claims before response
  4. Model selection — Data showed Claude 3.5 had 40% fewer hallucinations than GPT-4-turbo for this domain

Ongoing value

  • Regression prevention — Every code change tested against full suite
  • Model comparison — Objective data for evaluating new models
  • Failure analysis — Weekly review of worst-performing cases drives improvements

Key Learnings

  1. You can't improve what you don't measure — Gut-feel QA missed systematic failure patterns
  2. Retrieval > generation — Most "hallucinations" were actually retrieval failures
  3. Domain-specific evals — Generic benchmarks didn't predict legal accuracy
  4. LLM-as-judge works — But needs calibration and cross-validation
  5. CI integration is essential — Evals only matter if they run automatically

Engagement type: Reliability Upgrade
Timeline: 3 weeks from kickoff to production eval suite

This case study illustrates our capabilities with a representative scenario. Details have been generalized to protect client confidentiality.

Tech Stack

Custom PythonGPT-4ClaudeDeBERTa-v3GitHub ActionsStreamlit
GTA Labs — AI consulting that ships.