Every metric that matters when evaluating a legal AI system

5 Min Read

Abstract measurement graph with a single waveform curve inside a bordered panel, dashed threshold lines and vertical measurement brackets on a dark background, representing evaluation scoring

Building a legal AI system without evaluation metrics is like shipping code without tests. You think it works. You have not proven it.

The difference in legal AI is that the consequences of not knowing are higher. When a system incorrectly summarises an obligation, cites the wrong clause, or hallucinates a penalty amount, it does not throw an exception. It just produces a wrong answer that looks right — and that wrong answer might end up in a contract review, a compliance report, or a legal brief.

This post covers every metric that matters, grouped by what layer of the system they measure, with a clear explanation of when each one is worth tracking.

You cannot improve what you cannot measure. In legal AI, you also cannot trust what you have not measured.

Tier 1: Retrieval and grounding metrics

These measure whether your system is finding the right content and anchoring its answers in it. Failures here are foundational — if retrieval is broken, everything downstream is unreliable.

Faithfulness Rate

Measures whether the answer is fully supported by the retrieved content. A high faithfulness rate means the model is not inventing information beyond what it found. This is the primary guard against hallucination at the output level.

When to use it: always. This should be in every evaluation run.

Hallucination Rate

Specifically tracks instances where the model introduces facts, entities, dates, or obligations that are not present in the source document. Faithfulness and hallucination rate are related but distinct — faithfulness measures support, hallucination rate measures invention.

When to use it: whenever your system handles structured legal data like party names, dates, monetary amounts, or jurisdiction-specific rules.

Answer Relevancy

Does the answer actually address what was asked? A system can be faithful to its retrieved content and still give an irrelevant answer if retrieval pulled the wrong passages. This metric catches misalignment between the question and the response.

When to use it: particularly useful when questions are complex or multi-part.

Context Precision

Of the retrieved passages, how many were actually relevant to the question? Low context precision means your retrieval is noisy — it is pulling too much irrelevant content and forcing the LLM to work harder to find the signal.

When to use it: when you are debugging retrieval quality or comparing retrieval strategies.

Citation Accuracy

Does the answer reference the correct clause, page, and passage? This is the legal-specific metric that general evaluation frameworks almost always skip. A system can be faithful and relevant but still cite the wrong clause — and in legal workflows, the citation is often the deliverable.

When to use it: any system that surfaces source references to end users.

Citation Recall

When multiple clauses support an answer, does the system surface all of them or just the first one it finds? Low citation recall means users are getting incomplete pictures of their legal position.

When to use it: contract review, due diligence, compliance checking — any workflow where partial answers carry risk.

Refusal Correctness

Two-sided metric. When the answer is not in the document, does the system correctly decline rather than hallucinate? And when the answer is present, does it avoid unnecessary refusals? Both failure modes matter.

When to use it: any RAG system deployed to end users who will trust its scope.

Clause Coverage Score

For questions that require synthesising multiple clauses, does the answer account for all of them? A system with high faithfulness but low clause coverage is giving partial answers confidently — which is arguably worse than an explicit partial answer.

When to use it: complex contract queries, multi-clause obligation analysis.

Missing Clause Detection Rate

Specifically measures whether the system catches relevant clauses it failed to retrieve. This is retrieval recall viewed from the evaluation side — useful for identifying systematic gaps in what your retrieval pipeline surfaces.

When to use it: high-stakes document review where missing a clause has legal consequences.

Tier 2: Extraction and data integrity metrics

These measure precision on structured extraction tasks — pulling specific values, spans, and entities from legal text accurately.

Clause Span IoU

Intersection over Union for extracted clause boundaries. When your system identifies a clause, how closely does the extracted span match the ground truth span? Critical for systems that highlight or reference specific text regions.

When to use it: whenever bounding box or character-span accuracy matters to your UI or downstream pipeline.

Entity Extraction Accuracy

How accurately does the system identify named parties, organisations, and legal entities? Legal documents often have complex party structures — parent companies, subsidiaries, defined terms — and entity errors compound downstream.

When to use it: contract data extraction, due diligence, counterparty analysis.

Monetary Extraction Accuracy

Payment amounts, penalties, caps, and thresholds are some of the most consequential values in a contract. This metric tracks whether the system extracts them correctly, including currency, units, and conditions.

When to use it: any system that surfaces financial terms.

Date Normalisation Accuracy

Legal documents express dates in many formats. This measures whether the system normalises them correctly to a standard format — important for downstream date logic, deadline calculation, and timeline analysis.

When to use it: contract management, deadline tracking, event-triggered obligation systems.

Obligation Classification Accuracy

When classifying clauses as obligations, rights, or restrictions, how accurate is the system? Misclassifying a restriction as a right is the kind of error that survives review and causes problems later.

When to use it: contract analysis systems that output structured obligation data.

Tier 3: Legal logic and professional trust metrics

These are the hardest metrics to measure and the most important for systems used by legal professionals. They evaluate whether the system reasons correctly in a legal context, not just whether it retrieves and extracts accurately.

Legal Completeness

Does the answer contain all the elements a legally complete response requires? A system can give a technically correct but dangerously incomplete answer — for example, answering whether a termination right exists without surfacing the notice requirements attached to it.

When to use it: any system deployed in actual legal workflows.

Risk Classification F1

When detecting risky or unusual clauses — liability caps below market standard, unilateral amendment rights, broad indemnification — how accurately does the system flag them? This requires a ground truth dataset of labelled risky clauses.

When to use it: contract review tools, risk flagging systems, due diligence automation.

Ambiguity Handling Score

Legal clauses are often genuinely ambiguous. This metric evaluates whether the system correctly identifies ambiguous language and surfaces the multiple interpretations rather than collapsing to one. A system that always gives a confident single answer on ambiguous clauses is a liability.

When to use it: contract drafting support, interpretation assistance.

Conflict Sensitivity Score

Measures whether the system identifies and flags contradictory information within the retrieved context — for example, two clauses that impose conflicting obligations. This is especially important for multi-document retrieval where conflicts across documents are common.

When to use it: multi-document contract review, amendment analysis.

Reasoning Faithfulness

Even when a final answer is correct, the reasoning path that produced it might be legally unsound. This metric evaluates whether the logic the LLM used is valid, not just whether the conclusion happens to be right. A system that reaches right answers through wrong reasoning will eventually reach wrong answers.

When to use it: high-trust deployments where reasoning is surfaced to legal professionals.

Misleading Answer Rate

Tracks answers that are plausible and confident but legally wrong — the most dangerous failure mode. Unlike hallucination rate (which catches invented facts), misleading answer rate catches wrong legal conclusions drawn from real retrieved content.

When to use it: any production legal AI system.

How to build evaluation coverage in practice

Not every metric belongs in every evaluation run. The practical approach is to layer them:

  • Faithfulness rate, citation accuracy, refusal correctness, hallucination rate. Fast to compute, catches the regressions that matter most. In CI (every deploy):

  • Add clause coverage, context precision, entity extraction accuracy. Broader picture of system quality. In weekly eval runs:

  • Full tier 3 metrics. These require more careful dataset construction and human review but give you the legal completeness picture. Before major releases:

LexStack MicroEvals covers the CI layer out of the box. The metrics, datasets, and CLI are open source and designed to plug into your existing pipeline without rebuilding evaluation infrastructure from scratch.

Repo: github.com/LexStack-AI/LexReviewer


LexStack is open-source infrastructure for legal AI. It includes LexReviewer for document RAG, Law MCP for structured legal tools, and MicroEvals for CI-native evaluation.