---
title: "The context window is not your database"
description: "Context windows keep growing, but size alone degrades reasoning. We explain why treating the context window as a database breaks under agentic workloads, and why retrieval still matters."
excerpt: "Context windows can hold more than ever, but more context doesn't mean better answers. Here's why retrieval still matters, especially for agents."
date: "2026-02-28"
authors: ["everling"]
tags: ["Hornet.dev"]
category: "Thinking traces"
heroImage: "hero.webp"
references:
  - url: "https://arxiv.org/abs/2502.09977"
    title: "In Defense of RAG in the Era of Long-Context Language Models"
  - url: "https://arxiv.org/abs/2501.01880"
    title: "RAG vs. Long Context: Examining Frontier Large Language Models for Environmental Review Document Comprehension"
  - url: "https://arxiv.org/abs/2407.16833"
    title: "Is the Long Context of LLMs Truly Effective? A Mining Study of RAG vs. LC"
  - url: "https://hornet.dev/blog/mutually-assured-distraction"
    title: "Mutually assured distraction"
    description: "How plausible but incorrect evidence gets promoted into model reasoning."
  - url: "https://arxiv.org/abs/2307.03172"
    title: "Lost in the Middle: How Language Models Use Long Contexts"
  - url: "https://www.langchain.com/state-of-agent-engineering"
    title: "State of Agent Engineering — LangChain 2026 Survey"
  - url: "https://developers.googleblog.com/en/architecting-efficient-context-aware-multi-agent-framework-for-production/"
    title: "Architecting Efficient Context-Aware Multi-Agent Framework for Production"
  - url: "https://martinfowler.com/articles/exploring-gen-ai/context-engineering-coding-agents.html"
    title: "Context Engineering for Coding Agents"
    description: "Why giving agents more context often makes things worse."
---

Someone on your team has probably said: "Why don't we just put everything in the context window?" Context window limits have grown from 4K tokens to over 1M. If the documents fit, why bother with a retrieval pipeline?

We build retrieval infrastructure for agents at Hornet, so we hear this a lot. And for plenty of use cases, long context is the right call. If your knowledge base is small and your query volume is low, stuffing it all in the window is simpler and often more accurate than chunked retrieval {% citation ref=1 /%}{% citation ref=2 /%}.

But big problems start when teams treat "it fits in the context window" as a retrieval strategy. Length alone degrades reasoning, even when the model can find the right evidence. Real documents full of near-duplicate content make it worse. Agent loops multiply both problems across every turn. These compound fast.

## Two floppy discs

Our CEO Jo Kristian likes to put it this way: a million tokens is about three megabytes of text. That's two floppy discs from the 80s. If your data fits on two floppies, you probably don't need retrieval. Stuff it in the context, ship the feature, move on.

The catch is what happens when agents enter the picture. Agents don't make one call. They run in turns, and every turn carries the full context forward. A ten-turn session over a million-token context isn't one million tokens of input; it's ten million. At $6 per million input tokens (Sonnet 4.6 long-context pricing), that's $60 for a single agent run. A study by Xu et al. measured RAG costing roughly 1,250x less per query than equivalent long-context approaches {% citation ref=3 /%}. Once agents are looping, the gap compounds on every turn, and so does the cost.

![](llm-context-window-size.webp)

## More context, more mistakes

The economics matter, but the accuracy problem is what keeps us up at night. My colleague Lester Solbakken coined the phrase "mutually assured distraction" {% citation ref=4 /%} for the failure mode we kept seeing: plausible but incorrect evidence doesn't just sit in the window doing nothing. It gets promoted into the model's reasoning. Think about finance or legal documents, templated text where a single date or dollar amount is the only difference between two paragraphs. The model picks the wrong one, builds a confident answer on top of it, and nothing in the output tells you what happened.

Frontier models have mostly solved the positional recall problem (the old "lost in the middle" effect {% citation ref=5 /%}). But distractors are different from needles, and the problem compounds in agent loops. Every tool call appends its output to the context. Ten iterations into a research task, the window is packed with intermediate results, abandoned hypotheses, and superseded information. The stuff the agent actually needs for its next decision might be 2% of what's in the window.

We're not the only ones seeing this. LangChain's 2026 survey of 1,300+ practitioners put context management near the top of the pain list for teams running agents in production {% citation ref=6 /%}. Google's ADK codified "scope by default," where every model call sees minimum context {% citation ref=7 /%}. Fowler's team at Thoughtworks wrote about how the natural instinct when agents misbehave, giving them more context, actually makes things worse {% citation ref=8 /%}. Everyone arrived at the same place we did: don't put everything in the window.

## Why we keep building retrieval

Here's the pattern we see: A team starts with "put everything in context" because it's fast and it works in demos. They go to production and hit costs that scale per query, latency that grows with context size, and accuracy problems they can't consistently reproduce because the failures are statistical. They realize they need to add retrieval.

A retrieval layer narrows millions of candidates down to the handful that matter for a given query. Stable costs, access controls, provenance, and failures you can actually debug. Context windows will keep growing, and models will keep getting better at using them. But getting the right information into the context, and only the right information, is the hardest part of building reliable agents. That's retrieval. And it's worth building properly.

*We're building Hornet for teams working on this problem. To be notified about new posts, benchmarks, and early product notes, {% signup-link %}join our user community list{% /signup-link %}*