What we learned building a 100M-document search engine, part 3: Hybrid search depends on query type
We tested hybrid search across document-grounded query regimes at 100 million documents. The best merge strategy changed with query type, and the evaluation data had to be cleaned before the results meant anything.
Elisabeth Koren Halvorsen & Janne Beate Bakeng··10 min read
This is Part 3 of a three-part series on building and evaluating large-scale retrieval.
Part 1covered the operational reality of working at this scale.Part 2covered ANN tuning and the interaction between embeddings, index parameters, and query behavior. This post looks at hybrid search: what happens when lexical and semantic retrieval are combined, and why the answer depends on the type of query being issued.
For our corpus we found that query type strongly affects which merge algorithm works best. Reciprocal rank fusion was a strong default across query datasets, but weighted linear normalization performed better when the lexical and semantic weights were tuned to the query type.
Why hybrid search helps
Hybrid search combines two retrieval signals: lexical search and semantic search.
Lexical search matches the words in the query against the words in the document. It works well for names, identifiers, exact phrases, error messages, product codes, and other terms where wording matters. But it struggles when the relevant document uses different words than the query.
Semantic search compares meaning rather than exact wording. It works well when the query and document describe the same idea with different wording. But it can struggle when the query is vague, when many documents are topically similar, or when exact terms are important.
This means neither method is universally better. Their usefulness depends on the interaction between the corpus and the query.
Short, keyword-like queries often benefit from lexical retrieval because individual terms may carry much of the query intent, especially when they are rare, domain-specific, or require exact matching. Longer, more descriptive queries can provide semantic retrieval models with more contextual information, which may help when relevant documents use different wording from the query. However, query length is only a rough indicator: short queries can be ambiguous and benefit from semantic matching, while long queries may still contain important exact-match terms.
Hybrid search is useful because it lets both signals capture complementary evidence. The hard part is not deciding whether to combine lexical and semantic retrieval. The hard part is how to combine them for a particular query, corpus, and task.
Generating queries we could actually evaluate
In Part 2, MIMICS gave us realistic human queries, but it has no ground-truth mapping to our corpus, so we didn't have any objective way of measuring what was good or not. For this part, we needed document-grounded queries: queries generated from known documents so we could see whether the retrieved documents were in the result.
Our setup has a caveat. The source document is assumed to be relevant because the query was generated from it, but that does not prove it is, as it is not assured that this document is the best fit for the query. Nonetheless the metric is still useful; it is just not a perfect relevance judgment.
We sampled roughly 1,000 random documents from the 100 million document corpus and used an LLM to generate three query types from the same source documents:
Agentic queries were longer, more specific, and semantically focused. The prompt allowed synonyms and paraphrases. These were meant to resemble the explicit information requests an agent might send to a retrieval layer.
All-words-included queries used a similar information-seeking style, but every query word had to appear in the source document. This created high lexical overlap by design.
Human-style queries were short, fragmentary, and underspecified. They were shaped to resemble what a human would search for.
Here is a simplified illustrative example.
Source document:
In Q3, the support team reviewed 1,240 customer tickets across chat, email, and phone. The most common issue was delayed password reset emails, especially for users with Outlook and Yahoo accounts. A second major issue involved two-factor authentication codes arriving too late during peak evening hours. The report traced the password reset issue to a mail delivery queue bottleneck and duplicate annual renewal charges to a payment processor retry error.
Agentic query:
Find the main operational issues in Q3 customer support, their causes, affected user groups, trends in response performance, and recommended actions.
The three queries point at the same underlying document, but they give the retrieval system different evidence. The agentic query carries more semantic structure. The human-style query carries less context. The all-words-included query gives lexical retrieval a strong advantage.
Before we could trust results from these datasets, we needed to validate our query dataset. For evaluation, we used mean reciprocal rank, or MRR. MRR measures how high the known source document appears in the ranked retrieval results. For each query, we look at the rank of the first relevant document and take its reciprocal: a document at rank 1 gets a score of 1.0, rank 2 gets 0.5, rank 10 gets 0.1, and a missing document gets 0. The final MRR score is the average of these reciprocal-rank scores across all queries.
In this setup, a higher MRR means the retrieval system is placing the generated query's source document closer to the top of the results. That makes MRR useful for comparing the three query types, because it captures not only whether the source document was found, but how early it appeared. Since users and agents mostly care about the top results, this is more informative than only checking whether the document appeared anywhere in the result set.
Garbage in, garbage out
The first round of synthetic queries gave poor results. The problem was the source documents. Common Crawl contains many very short pages: error pages, login screens, cookie notices, fragments of navigation, and pages whose content is effectively just "Page Not Found." When those documents were sampled, the LLM generated queries from source material with very little signal.
Semantic retrieval had little meaningful context to target. Lexical retrieval had few distinctive terms to match. It was tempting to read this as a retrieval failure, but both retrieval modes performed poorly because the source documents had too little content to be distinctive.
To remove the non-meaningful queries we added a minimum content length of 500 bytes for source documents used in synthetic query generation.
After regenerating the synthetic query datasets with the content-length restriction the MRR improved. The important methodological point is that once the source documents contained enough content, the benchmark began measuring meaningful retrieval behavior.
Put bluntly: garbage in, garbage out. A synthetic query benchmark only works if the source documents contain enough real information to generate meaningful queries. Otherwise, we are not testing retrieval quality; we are testing how well the system handles noise.
The best merge strategy changed with the query
Before merging, it helps to see how the two retrieval modes behaved on their own across the three query types. Lexical retrieval was strongest on all-words-included queries, where every query term appeared in the source document. Semantic retrieval was strongest on agentic queries, which used richer intent and allowed paraphrases. Human-style queries were harder for both modes: short, fragmentary queries gave neither signal enough evidence.
For the hybrid experiments, we retrieved the top 2,000 candidates from each retrieval mode, merged the candidate lists, then computed MRR@10 after applying a merge strategy to the merged set.
We tested two merge strategies. Reciprocal Rank Fusion (RRF) combines results by rank position, which avoids unfair comparison between the raw scores coming from lexical and semantic retrieval. Linear Normalization (LN) maps scores from each retrieval mode onto a common 0 to 1 scale, then combines them with a weighting parameter called alpha. In our setup, alpha controlled the lexical weight: 1.0 meant pure lexical, and 0.0 meant pure semantic.
For all-words-included queries, the best LN setting was lexical-heavy, but not pure lexical. The curve peaked at alpha 0.75, where lexical retrieval dominated but semantic retrieval still contributed.
If every query term appears in the source document, why does a semantic contribution help? Exact term overlap is strong, but incomplete. Lexical scoring can still produce ties or near ties when multiple documents contain similar words. A small semantic contribution can help separate documents that are lexically similar but not equally relevant.
For agentic queries, the pattern reversed. The best LN setting was semantic-heavy. These queries used richer intent and could rely on paraphrase, so semantic retrieval had the advantage. Pure semantic was not the only useful signal, though. Some lexical anchoring still helped because agentic queries often include keywords.
For human-style queries, RRF was the strongest and most stable choice, but it should be noted that it performed significantly worse than the all-words-included and agentic query styles. The LN curve came closest when lexical and semantic weights were more balanced, but the differences were smaller than in the other query regimes. This fits the shape of the workload: human-style queries are short and ambiguous enough that it is harder to know in advance which signal should dominate.
So the answer to "what is the best hybrid strategy?" is: for which queries?
RRF was not the single best strategy for every query type, but for our corpus it was a good default for all query types that we tested on. LN performed best when the query style was known and the weight could be tuned to match it.
There is no default merge strategy
It is tempting to treat hybrid search as something you can tune once: pick a merge algorithm, choose a lexical/semantic weight, and ship it.
But there is no globally correct merge strategy. The best choice depends on your corpus and on the queries your system actually receives. Agentic queries and human-written queries can behave very differently, and they may need different merging strategies.
So the practical lesson is simple: test on your corpus, with your query mix, before committing to a merge strategy.
Retrieval works as a system, not a score
Across the three posts, the pattern is consistent. In Part 1, the operational cost of learning dominated the work. A full reindex took long enough that every configuration mistake had a large cost. In Part 2, parameter assumptions interacted with embedding behavior and graph quality. In Part 3, the query type changed which retrieval signal mattered most.
That is where we would leave someone starting a similar project: retrieval evaluation is not one benchmark table at the end. Check that the corpus contains what you think it contains. Check that documents are fed correctly. Check the length distribution of retrieved documents. Check your assumptions.
There are still open questions we want to explore. How do agents actually formulate retrieval queries differently from humans? What changes when retrieval moves from document-level to chunk-level? When does diversity become a ranking concern? How should filtering, quantized embeddings, ANN, lexical retrieval, and reranking be combined into one efficient pipeline?
This series started with a question that sounded simple: what does it take to build and evaluate retrieval at 100 million documents? The answer turned out to be a machine with several interlocking gears. The index is one gear. The embeddings and their interaction with ANN parameters are another. The merge strategy between lexical and semantic retrieval is a third. Holding these gears in alignment is the evaluation pipeline, which only produces trustworthy numbers when the corpus, the queries, and the relevance assumptions are each examined on their own terms.
You can't just build something and expect it to be right at the first shot; the system needs validation. The work is in building the feedback loop that tells you whether the configuration is right for you.
Part 1covered the operational reality of building and iterating at this scale.
Part 2covered ANN tuning: how index configuration, embedding behavior, and graph connectivity interact at 100 million documents.
We're building Hornet for teams working on this problem. To be notified about new posts, benchmarks, and early product notes,