Unit tests for legal AI: why your agent needs a CI eval pipeline

Mar 12, 2026

7 Min Read

Abstract waveform graph inside a bordered panel on a dark background, with dashed horizontal threshold lines and vertical measurement markers, representing evaluation score thresholds

You ship a legal AI agent. It works well in testing. Three weeks later, a prompt template gets updated, a model version changes, or a new document type enters the corpus. Nobody notices. The agent keeps responding. The answers are still fluent.

But citation accuracy has dropped 20%. The agent is now occasionally citing the wrong clause. It is refusing questions it should answer. It is summarizing obligations that are not there.

Unlike a crashing API, silent degradation does not raise an error. It quietly erodes trust until a lawyer catches a wrong answer in a high-stakes review.

Why standard LLM evals are not enough

General-purpose LLM evaluation frameworks measure things like answer correctness, toxicity, and response coherence. Those matter. But for legal AI specifically, they miss the failure modes that actually break production workflows.

Legal AI fails in ways that general evals do not cover:

the answer is correct but the source reference is wrong or missing Citation failure:
the model paraphrases an obligation in a way that changes its legal meaning Clause drift:
a date, a party name, or a penalty amount that is not in the document Hallucinated facts:
the agent declines to answer a question that is clearly in scope Wrong refusals:
the agent pulls a clause from the wrong contract in a multi-document session Cross-document confusion:

These are the failures that matter in legal workflows. They require evaluation metrics built specifically for legal context.

What MicroEvals are

MicroEvals are fast, repeatable behavioral tests designed to run in CI. Think of them the way you think about unit tests for code. They do not check everything. They check the specific behaviors that must not regress.

A MicroEval run takes a small, curated dataset (30 to 40 samples is enough to start), runs your agent against it, and scores outputs against a defined set of legal-specific metrics. The whole run is fast enough to sit inside a GitHub Actions workflow without blocking your pipeline for long.

The goal is not a comprehensive benchmark. It is a quality gate. If citation accuracy drops below your threshold, the run fails and you know before production.

Key metrics explained

Faithfulness Rate

Is the answer fully supported by the retrieved text? This catches hallucination at the grounding level, not just at the output level.

Citation Accuracy

Does the answer reference the correct clause, page, and passage? This is the legal-specific failure mode that general evals ignore entirely.

Refusal Correctness

When the answer is not in the document, does the agent correctly decline rather than invent something? And when the answer is present, does it avoid refusing unnecessarily?

Clause Coverage Score

For questions that require citing multiple clauses, does the agent surface all of them or just the first one it finds?

Hallucination Rate

Does the model introduce facts, entities, dates, or obligations not present in the source document?

Full evals vs MicroEvals: when to use which

MicroEvals are for CI. They run on every push, every PR, every deployment. They are fast and focused on regression, not discovery.

Full evals are for benchmarking. They run less frequently and give you a broader picture of where your legal AI stands relative to the market. LexStack keeps full evals private and MicroEvals open source, because the use cases are different: full evals tell you where you stand, MicroEvals stop you from slipping backwards.

How to plug MicroEvals into your pipeline

MicroEvals are open source and framework-agnostic. Your agent file needs to expose one function:

def run_agent(document: str, question: str) -> str

Then point the CLI at your agent and a dataset:

python cli.py my_agent.py nda_basic

The CLI loads the dataset, calls your agent on each sample, calculates metrics, and prints a scored report. Drop this into a GitHub Actions step and set thresholds that fail the build if scores fall below acceptable levels.

The datasets in LexStack cover NDA basics, CUAD contract clause extraction, and MAUD merger agreement conditionality clauses. You can also run RAGAS synthetic test set generation on your own contracts to build a custom dataset in minutes.

Why this matters now

Legal AI is moving from prototype to production. Teams that ship legal agents without evaluation infrastructure are accumulating technical debt they cannot see. The agent keeps working. The quality keeps slipping. Nobody knows until something goes wrong.

MicroEvals is the layer that makes legal AI shippable with confidence. Not because it catches everything, but because it catches the things that matter on every single deploy.

Getting started

MicroEvals is open source and part of the LexStack infrastructure. The repo is live and the CLI is ready to use.

Repo: LexReviewer

If you run MicroEvals through the LexStack infrastructure, each eval run deducts from your prepaid credits. No subscription required to start.

LexStack is open-source infrastructure for legal AI. It includes LexReviewer for document RAG, Law MCP for structured legal tools, and MicroEvals for CI-native evaluation.

‹ How to give your AI agent access to law as a structured tool

Why "chat with PDF" breaks for legal documents — and how we fixed it ›