GTA Labs — AI consulting that ships.

The Problem

A legal tech startup had built an AI-powered case research tool. Users could ask questions like "What are the key precedents for wrongful termination in Ontario?" and get synthesized answers with citations. The product worked—mostly. But 1 in 5 responses contained factual errors: made-up case names, incorrect dates, or citations that didn't support the claims.

For a legal product, this was existential. One wrong citation could destroy user trust and expose clients to malpractice risk.

Pain points

No systematic testing — QA was manual spot-checking by lawyers
Silent failures — Wrong answers looked just as confident as right ones
Model drift — Switching from GPT-4 to GPT-4-turbo introduced new failure modes
No baseline — "Is this better?" was answered by gut feel, not data

The Intervention

We built a comprehensive evaluation framework covering accuracy, groundedness, and citation quality:

Eval dimensions

Factual accuracy — Are claims true? Method: LLM-as-judge against source docs
Groundedness — Is every claim supported by retrieved context? Method: NLI model + citation verification
Citation precision — Do citations actually say what's claimed? Method: Extractive matching + semantic similarity
Completeness — Are key precedents included? Method: Golden answer comparison
Hallucination rate — % of responses with fabricated content. Method: Multi-model consensus check

Test suite architecture

Pipeline flow:

Test Case (question + golden answer + source docs)
RAG Pipeline processes the question
Response + Citations generated
Eval Suite runs all checks:
- FactualAccuracyEval
- GroundnessEval
- CitationPrecisionEval
- CompletenessEval
Scores + Failure Analysis output

Test corpus

200 curated questions across 8 legal domains
Golden answers written by practicing lawyers
Source documents — 500+ cases from CanLII
Adversarial examples — Questions designed to trigger hallucinations

Stack

Eval framework: Custom Python, inspired by RAGAS and DeepEval
LLM judges: GPT-4 + Claude for cross-validation
NLI model: DeBERTa-v3 for groundedness checks
CI integration: GitHub Actions runs evals on every PR
Dashboard: Streamlit for exploring failures

The Outcome

Before/after metrics

Before → After:

Hallucination rate: 18% → 3.5%
Citation accuracy: 71% → 94%
Groundedness score: 0.68 → 0.91
Factual accuracy: 79% → 93%

What changed

Retrieval improvements — Eval revealed that 60% of hallucinations stemmed from poor retrieval, not generation
Prompt engineering — Added explicit "only cite if directly supports claim" instruction
Post-processing — Citation verification step removes unsupported claims before response
Model selection — Data showed Claude 3.5 had 40% fewer hallucinations than GPT-4-turbo for this domain

Ongoing value

Regression prevention — Every code change tested against full suite
Model comparison — Objective data for evaluating new models
Failure analysis — Weekly review of worst-performing cases drives improvements

Key Learnings

You can't improve what you don't measure — Gut-feel QA missed systematic failure patterns
Retrieval > generation — Most "hallucinations" were actually retrieval failures
Domain-specific evals — Generic benchmarks didn't predict legal accuracy
LLM-as-judge works — But needs calibration and cross-validation
CI integration is essential — Evals only matter if they run automatically

Engagement type: Reliability Upgrade
Timeline: 3 weeks from kickoff to production eval suite

This case study illustrates our capabilities with a representative scenario. Details have been generalized to protect client confidentiality.

LLM Quality Testing Framework

Key Results