
This is what agentic retrieval looks like
GPT-5 searches like a power user. Phrase quotes in 98% of sessions. Median query length past the human 99th percentile (AOL query log). The new user of search.

GPT-5 searches like a power user. Phrase quotes in 98% of sessions. Median query length past the human 99th percentile (AOL query log). The new user of search.

Hybrid search does not have one best weight. Agentic, all-words-included, and human-style queries each stress retrieval differently, which means evaluation has to be stratified.

We expected ANN tuning to be straightforward. It was not. Embedding bias, graph connectivity, and quantization loss interacted in ways that only surfaced at 100 million documents.

Give the agent the relevant documents and it answers 93% of questions correctly. Make it find those documents with a weak retriever and it scores 14%. In BrowseComp-Plus, that gap makes retrieval impossible to ignore.

We built a retrieval system over 100 million web documents. The biggest lesson was not about any single configuration. When every experiment takes 30 hours, the speed of your reindex-evaluate loop determines how good your system can get.

Context windows can hold more than ever, but more context doesn't mean better answers. Here's why retrieval still matters, especially for agents.