---
title: "Agent queries cost more to serve"
description: "Agent queries are longer and carry more operators than human queries. On the same engine and index, that distribution shift cuts serving 
  capacity 8.4x, and agents are now the workload to optimize for."
excerpt: "A human types two words; an agent sends ten. That shift in query mix means 8.4x the cost to serve."
date: "2026-06-03"
authors: ["bergum"]
tags: ["Agentic retrieval", "Infrastructure"]
category: "Proof"
slug: "agentic-query-workloads-change-retrieval-cost"
heroImage: "hero.webp"
---

I've worked on large-scale production retrieval systems for decades, and some of the strangest load-related outages I have seen were caused by changes in query mix. Most people think serving capacity only runs out because query volume spikes. Just as often, the query mix changes and the system overloads at a traffic level that used to be safe.

For example, a frontend starts adding filters to every query. Or, with no backend change at all, a customer starts sending longer queries or phrase matching becomes more common. A [phrase query](https://nlp.stanford.edu/IR-book/html/htmledition/positional-postings-and-phrase-queries-1.html) made of common words ("to be or not to be") is more CPU-expensive to evaluate than a simple conjunction (AND) between two rare words.

Everyone who has run retrieval infrastructure in production at scale knows that capacity planning is not just query volume planning; it is query mix planning too. This is also why I shrug when I hear people say *"we can do x thousands of qps on this hardware"* as it says little about the capacity of the system without knowing what the query mix is. Is that with a realistic query mix that is representative of the workload you expect?

## The query distribution changes with agents

Agents query differently. Most infrastructure discussions about agents as the new user start with volume.
A human might search a few dozen times in a session, limited by typing and reading speed. An agent can [search hundreds or thousands of times](https://arxiv.org/abs/2601.17617) while it tests hypotheses, gathers evidence, and verifies its answer. That volume increase matters. But even with unchanged query volume, the query workload changes.

I went deep on what the agentic query workload looks like in [this is what agentic retrieval looks like](/blog/this-is-what-agentic-retrieval-looks-like).

![Side-by-side anatomy of two real queries. A human query from the AOL search log, "home depot", is two terms. A GPT-5 BrowseComp agent query, site:snooker.org 2021 UK Championship final Zhao Xintong Brecel "Frame scores", is ten terms and carries a site: filter, a year, and a quoted phrase match. The agent query is 5x longer than the human median and adds phrase and filter operators that each cost extra query-processing work.](./anatomy-of-a-query.webp)
*A typical human keyword query (AOL search log) next to a GPT-5 BrowseComp agent query. Both are real; the operators an agent reaches for are labeled.*

It is not just length. The agent query carries a phrase match, a year, and a `site:` filter, and each operator adds query-processing work on top of the extra terms.

![Horizontal bar chart comparing query length percentiles between the AOL search log (2006, 36 million human queries) and the GPT-5 + BM25 BrowseComp-Plus trace (19,279 agent queries). Human queries: median 2 terms, p90 5, p99 8. Agent queries: median 10, p90 17, first-query mean 19. The agent median sits past the human 99th percentile.](./human-vs-agent-query-length.webp)
*Query length percentiles, AOL search log (human) vs the GPT-5 + BM25 BrowseComp-Plus trace.*

## Same engine and same data, different serving capacity

If you want to see how much serving capacity can swing, keep the engine and data fixed and change only the query mix. We indexed 100M English Common Crawl WET documents in Hornet and ran three query workloads against the same hardware, the same index, and the same `BM25(body)` top-10 retrieval, changing only the queries. Each test ran on a single AWS Graviton4 instance with 32 vCPUs and 128 GiB of memory. Hornet served the 100M-document index entirely from memory in about 56 GiB.

![Horizontal bar chart comparing QPS at the last measured point at or below 500 ms p99 for Hornet over 100M Common Crawl documents across human and agent query workloads: AOL human keyword queries at 3,236 QPS, MS MARCO human questions at 1,151 QPS, and BrowseComp GPT-5 agent queries at 384 QPS. The same index sees an 8.4x serving capacity swing.](./query-workload-qps-swing.webp)
*Last measured sub-500 ms p99 QPS on the same Hornet system and 100M-doc index, changing only the query workload.*

That is an 8.4x spread in serving capacity on the same system. The agent workload is the expensive end of the chart, because the longer queries require more processing.

Some of the mechanics of query length and query processing over inverted indexes are covered in [the scaling dimensions of keyword search](/blog/the-scaling-dimensions-of-keyword-search), so I won't cover the fundamentals of posting lists, BM25 scoring, and [various dynamic top-k pruning techniques](https://www.elastic.co/blog/faster-retrieval-of-top-hits-in-elasticsearch-with-block-max-wand) here.

## Agents are a new user, and their workload costs more to serve

Now that agents are showing up as a serious share of retrieval traffic, we have a new workload to plan for. They don't just raise how many queries you serve; they raise what each query costs to serve.

I don't know how many times I have said this in my career: benchmark your system with a realistic query mix. Otherwise you are in for a surprise once you get real users. [Database teams learn the same lesson the hard way](https://www.percona.com/blog/the-importance-of-realistic-benchmark-workloads/), and it applies just as much to search. A QPS number that does not say what kind of queries produced it is missing the most important part of the story, and with agents that story is as much about cost to serve as raw capacity.

## Hornet is optimized for the new workload

A retrieval engine for agents has to use query shape when it plans query execution, and that is the workload Hornet is built for. On the BrowseComp GPT-5 agent-query workload, Hornet sustains roughly 1.6x the throughput of the closest comparison engine at the same 500 ms p99 latency target. It holds p99 under these long queries with a new dynamic query pruning algorithm that evaluates the query in parallel and keeps memory access cache-friendly under load.

That comes down to how the index postings are laid out and how the engine processes queries over them. Posting lists are stored in compact, fixed-size blocks with the metadata to skip whole blocks that cannot make the top-k, and those blocks decode with SIMD so the engine decompresses postings in bulk instead of one document at a time. How it processes those postings is chosen by query shape: a document-at-a-time (DAAT) traversal with dynamic pruning when a few rare terms carry the score, and a block-at-a-time (BAAT) accumulation when a long query spreads weight across many common terms. Long agent queries land in that second case far more often than a two-word human query ever did.

Every engine ran on the same single Graviton4 instance, served the index from memory, and was configured to its own documented best practices, so the comparison reflects each engine at its best rather than out of the box.

![Line chart comparing p99 latency versus QPS for anonymized engines on BrowseComp GPT-5 agent queries over 100M documents. Hornet stays below 500 ms p99 until 384 QPS. Engine B and engine A cross the 500 ms p99 line around 230 to 233 QPS, while engine C starts above 1 second p99.](./agent-query-engine-p99-spike.webp)
*p99 latency versus throughput on BrowseComp GPT-5 agent queries. Comparison engines are anonymized.*

How to read this chart: the x-axis is the observed throughput (QPS) with increased load (benchmarking clients). The y-axis is the observed p99 latency. Each line represents one engine. Push load, and the observed QPS goes up (left to right), and latency stays flat while the engine has headroom, then bends sharply upward as it hits a bottleneck and query requests start queuing. Where a line crosses the 500 ms p99 line is the most traffic that engine can serve within the latency target.

**At the same p99 latency target, Hornet serves 384 QPS where the closest alternatives reach about 230 QPS, roughly 1.6x the throughput per node. To absorb the same agentic query volume, you would need to provision about 1.6x the machines with those engines.**

A two-word query typed by a human was the primary workload that dynamic top-k pruning algorithms were optimized for. Agents are the new user of retrieval infrastructure, and Hornet is built to serve that new workload, at the latency and cost that lets you scale them.

*We're building Hornet for teams working on agentic retrieval. To be notified about new posts, benchmarks, and early product notes, {% signup-link %}join our user community list{% /signup-link %}.*