How hybrid retrieval works — and why legal AI needs both vector search and BM25

Mar 12, 2026

8 Min Read

Three horizontal input bars on the left connected by dashed arrows routing to different output circles on the right, representing query routing logic between multiple retrieval strategies on a dark background

When you build a RAG system, one of the first decisions you make is how to retrieve relevant content. Most teams default to vector search — embed the query, find the nearest neighbours in the vector space, return the top results. It works well enough that you ship it and move on.

For legal documents, that decision creates a systematic blind spot.

Legal text is not like other text. It mixes natural language reasoning with precise, structured identifiers. A single document might contain both a plain-language explanation of a concept and a specific defined term with an exact section reference. Vector search handles the first type well. It handles the second type poorly. BM25 does the opposite.

Semantic-only retrieval misses exact clause references. Keyword-only retrieval misses semantic intent. Legal documents require both, every time.

How vector search works

Vector search converts both documents and queries into dense numerical representations (embeddings) that capture semantic meaning. Documents with similar meaning cluster together in the embedding space, regardless of whether they share the same words.

When a query comes in, it gets embedded into the same space. The retrieval system finds the documents (or chunks) whose embeddings are closest to the query embedding using a distance metric, typically cosine similarity.

This is powerful for natural language questions. A query like 'what happens if the supplier fails to deliver on time' will correctly surface clauses about delivery failures even if those clauses use different vocabulary — 'non-performance', 'breach of delivery obligation', 'failure to meet SLA'.

Where vector search breaks on legal text

The problem is legal documents frequently contain queries that are not semantic at all. Consider:

'Section 4.2(b)' — a direct clause reference
'Schedule 1' — a reference to an attached document
'Intellectual Property Rights as defined in clause 1.1' — a defined term reference
'LIBOR plus 2.5% per annum' — a precise financial term

Vector search will not reliably retrieve these. The embedding of 'Section 4.2(b)' does not cluster meaningfully near the actual content of Section 4.2(b) — it clusters near other section references. The semantic signal is too weak.

In legal workflows, exact-phrase queries are not edge cases. They are a significant proportion of real usage. Contract reviewers ask about specific clauses by reference all the time. A system that silently fails on these queries loses trust fast.

How BM25 works

BM25 (Best Match 25) is a classical information retrieval algorithm. It ranks documents based on the frequency of query terms in each document, normalised by document length and adjusted by term frequency saturation.

In plain terms: BM25 finds documents that contain the words in your query, weighted by how distinctive and frequent those words are. It is fast, interpretable, and excellent at exact-phrase and keyword lookup.

BM25 score = sum over query terms of: IDF(term) * (tf * (k1 + 1)) / (tf + k1 * (1 - b + b * doc_len / avg_doc_len))

IDF (Inverse Document Frequency) gives higher weight to rare terms. TF (Term Frequency) captures how often the term appears in the document. The k1 and b parameters control saturation and length normalisation.

Where BM25 breaks on legal text

BM25 has no understanding of meaning. It matches words, not concepts. A query like 'what are the consequences of missing a payment deadline' will not reliably surface clauses about 'late payment penalties' unless those exact words appear.

Legal language is also highly specialised. The same concept appears in many forms across different contracts and jurisdictions. BM25 alone will miss synonymous concepts, paraphrased obligations, and semantically equivalent clauses that use different vocabulary.

Why combining them changes the accuracy picture

Hybrid retrieval runs both systems on every query and merges the results. The merge step is where the design decisions live.

In LexReviewer, Qdrant handles vector search and rank-BM25 handles keyword retrieval. Results from both are merged using a reciprocal rank fusion approach — results that appear in both lists get boosted, results that appear in only one list are still surfaced but ranked lower.

The effect is that the final result set covers both failure modes:

vector search finds conceptually related clauses even with different vocabulary Semantic queries
BM25 finds precise clause references, defined terms, and structured identifiers Exact-phrase queries
clauses that satisfy both conditions get ranked higher, increasing precision Overlapping queries

Hybrid retrieval does not just add coverage. It improves precision on the queries where you need it most — the ones where being wrong has legal consequences.

The role of chunk metadata

Retrieval accuracy is only half of the picture. In legal AI, you also need to know where in the document the retrieved content lives — not just to cite it, but to provide bounding boxes for UI highlighting.

LexReviewer uses Unstructured.io for chunking, which preserves layout metadata including page numbers, section headers, and bounding box coordinates. This metadata travels with the chunk through the entire pipeline — from indexing into Qdrant to storage in MongoDB to surfacing in the streamed response.

When a citation is returned with reference_positions, it contains enough information to highlight the exact passage in the source PDF. That is not possible without layout-aware chunking that preserves spatial metadata from the start.

Tuning hybrid retrieval for your use case

The merge weights between vector and BM25 results are a tunable parameter. The right balance depends on your query distribution:

weight vector search higher If your users mostly ask semantic questions
weight BM25 higher If your users frequently reference specific clauses by ID
equal weighting with reciprocal rank fusion is a solid default If you are unsure

You can evaluate the impact of different weighting strategies using MicroEvals — run your eval suite with context precision as the primary metric and compare the scores across configurations. That gives you an empirical basis for the tuning decision rather than a guess.

A note on embedding model choice

Vector retrieval quality depends significantly on the embedding model. LexReviewer uses OpenAI's text-embedding-3-large (3072 dimensions) by default. For legal text specifically, larger embedding dimensions help — legal language is precise and the embedding space needs enough capacity to separate similar but legally distinct concepts.

If you are evaluating alternative embedding models, use citation accuracy and context precision from your MicroEvals suite as the benchmark. Those metrics will surface retrieval quality differences that general benchmarks miss.

Getting started

Hybrid retrieval is built into LexReviewer out of the box. You do not need to configure it separately — upload a document, start querying, and both retrieval strategies run automatically on every request.

Repo: LexReviewer

LexStack is open-source infrastructure for legal AI. It includes LexReviewer for document RAG, Law MCP for structured legal tools, and MicroEvals for CI-native evaluation.

‹ LexStack vs building your own legal RAG pipeline: an honest comparison

Why we open-sourced LexStack — and what we are building on top of it ›