---
title: "Code mode for agentic retrieval"
description: "Agents can write retrieval code. The hard part is exposing the primitives that decide what comes back, how it is scored, and what survives into context."
excerpt: "Frontier models are strong at code, and code lets agents manage retrieval without filling context. The next step is programmable retrieval primitives, not just programmable query loops."
date: "2026-06-18"
authors: ["everling", "bergum"]
tags: ["Agentic retrieval"]
category: "Thinking traces"
slug: "code-mode-for-agentic-retrieval"
heroImage: "hero.webp"
---

Search interfaces for humans hid retrieval behind one input box: type a few words, let the backend run a bespoke retrieval and ranking pipeline, and get back a ranked list of links. That contract made sense for human users. It makes less sense for agents.

Advanced search exposed more controls: Boolean operators, field filters, date constraints, exact phrases, source restrictions, ranking controls. Those controls mattered to power users, but almost nobody wanted to operate the advanced form directly.

## The new user writes code

Agents change that constraint. Frontier models are good at writing code: they can plan across many retrieval calls, keep intermediate state outside their context window, and apply logic before deciding what evidence to keep for the next turn. For agents, advanced mode is code, not an advanced search form.

Code also [compresses context](/blog/the-context-window-is-not-your-database): a program can fan out queries, filter results, and return compact evidence instead of dragging every intermediate tool call result back into the model context.

That shifts the abstraction from `search(query: str)` to `search(code: str)`: code uses a harness-specific SDK to assemble the retrieval loop.

We made this feedback-loop argument in [How we build a retrieval engine for agents](/blog/how-we-build-a-retrieval-engine-for-agents): agents can configure, test, correct, and optimize retrieval when it looks like code. Anthropic showed the context-management side one layer up in [Code execution with MCP](https://www.anthropic.com/engineering/code-execution-with-mcp), reporting a workflow that fell from 150,000 to 2,000 tokens once tool calls became code in a sandbox. Cloudflare shipped the same pattern as ["Code Mode"](https://blog.cloudflare.com/code-mode/), and [Perplexity's Search as Code](https://research.perplexity.ai/articles/rethinking-search-as-code-generation) applies it to web search.

## Code mode has to reach the primitives

Once `search(code)` is the interface, the obvious work is the harness: the sandbox, the SDK, and the instructions that teach a model to use them. That layer gives the agent a retrieval language and a way to manage context without dragging every tool call result into the model context, where it compounds with each agentic turn. That only works if the retrieval primitives work well.

![Diagram showing code mode for retrieval: the agent writes a retrieval loop that keeps bulky state outside context, programs retrieval primitives such as impact weights, freshness, scoring, filters, and fields, then returns only structured evidence to the model context.](./harness-and-engine.webp)

An agent can fan out twenty queries, dedupe, filter by date, and keep the survivors. If retrieval recall is low, no harness engineering recovers the documents the primitives never surfaced. The retrieval primitives decide what the code loop can inspect. [BrowseComp-Plus](https://arxiv.org/abs/2508.06600) gives a clean example: GPT-4.1 scored 93.49% when prompted with all labeled positive documents and 14.58% when the same model had to use BM25 retrieval, a 78.91-point gap with the model and corpus held fixed.

![BrowseComp-Plus accuracy comparison showing retrieval quality as the bottleneck: GPT-4.1 scores 93.49% when prompted with all labeled positive documents and 14.58% with BM25 retrieval over the same corpus, a 78.91-point retrieval-condition gap.](./retrieval-bottleneck.webp)

*Source: [BrowseComp-Plus](https://arxiv.org/abs/2508.06600), Table 1 (`gpt-4.1 + BM25`) and Section 4.8.1 (oracle retrieval).*

That is why code mode for retrieval cannot stop at the query loop. Agents should be able to program retrieval behavior itself: [impact weights](/blog/100m-doc-search-part-3-hybrid-search), freshness functions, scoring functions, and other controls that determine what gets returned.

## Retrieval has to return structure, not chunks

Code execution lets the agent inspect intermediate results before spending context tokens on them. Anyone who has watched a coding agent preserve context with shell tools has seen the pattern: keep bulky intermediate state outside the model context, operate on it with code, and bring back only what matters.

Result processing needs that shape: inspect fields, page through candidates, join lists, apply predicates, and write only the surviving evidence into context.

The useful primitive is a structured result set the agent can test, filter, and use to decide what to try next, rather than a top-k list of pre-split chunks.

## Run the loop close to your data with Hornet

Code mode for retrieval has to run close to the corpus: the public web, enterprise PDFs, source code, internal knowledge bases, or documents of any kind. Web-scale search and private retrieval both need relevance primitives an agent can inspect and control: freshness, scoring, weighting, filters, and structured results.

[Relevance is still the horizontal challenge](/blog/the-case-for-a-new-retrieval-engine-for-agents), but the new user changes the target: [longer queries, structured retrieval calls, and many more operations inside one task](/blog/this-is-what-agentic-retrieval-looks-like). [Plausible distractors](/blog/mutually-assured-distraction) can become premises, pollute later turns, and make the loop confidently wrong. Retrieval has to become something agents can [configure, test, and correct](/blog/how-we-build-a-retrieval-engine-for-agents).

The agent can write the retrieval code. Hornet makes the code useful by exposing the primitives that decide what comes back, how it is scored, and what survives into context.

*We're building Hornet for teams working on agentic retrieval. For new posts, benchmarks, and early product notes, {% signup-link %}join our user community list{% /signup-link %}.*
