Issue #2

The Skill That Actually Separates Production Agents from Demos

When an agent fails, most developers look for a better model. The problem is almost always the context.

When an LLM agent fails in production, most developers go looking for a better model. Swap out the underlying LLM, adjust temperature, retry. If that doesn't work, rewrite the prompt.

Context Engineering: The Skill That Separates Production Agents from Demos — 60-second summary

What they're usually missing: the problem isn't the model and it isn't the prompt. It's the context.

LangChain put it plainly in their post on what they're calling context engineering: when agentic systems fail, it's almost always because the model wasn't given what it needed to succeed. Not because it was incapable. The failure is upstream, in the system that decides what information gets handed to the model before it generates a response.

That shift in diagnosis changes everything about how you build.

What Context Engineering Actually Is

Prompt engineering is about how you phrase a request. Context engineering is about what surrounds that request: the full information environment the model sees at inference time.

Philipp Schmid at Google DeepMind defines it as "the discipline of designing and building dynamic systems that provide the right information and tools, in the right format, at the right time, to give an LLM everything it needs to accomplish a task." Tobi Lutke, Shopify's CEO, described it more bluntly: it's the art of providing all the context for the task to be plausibly solvable.

The distinction from prompt engineering matters practically. A prompt is a string. Context is a system.

That system assembles at runtime, dynamically, for each specific request: system instructions, the current user message, recent conversation history, long-term memories retrieved from persistent storage, results from tool calls, relevant documents pulled from external sources, and output format definitions. All of that lands in the model's context window before it generates a single token. The engineering challenge is deciding what goes in, what stays out, and in what order.

Most developers treat this assembly as an afterthought. The developers building agents that actually work in production treat it as the core design problem.

The Science Behind Why This Matters

Models can't read minds, but they also can't reliably process everything you throw at them. This is the uncomfortable reality Chroma Research documented in a July 2025 study evaluating 18 state-of-the-art LLMs on retrieval and text replication tasks across varying context lengths. (research.trychroma.com/context-rot)

Performance degraded consistently as input token count increased. On some tasks, accuracy dropped from 70-75% with a lean context down to 55-60% when 20 retrieved documents were present. That's an agent that was working suddenly failing once you give it more information.

The mechanism is architectural. The transformer attention pattern creates n-squared pairwise relationships across every token in the context. As context grows, that capacity gets stretched across more tokens, and the model's focus on any particular piece of information dilutes. Anthropic's engineering blog describes it as an "attention budget" — finite, depletable.

There are two consistent failure patterns. One is positional: models reliably attend more to information at the beginning and end of context, and miss what's buried in the middle — documented as the "lost-in-the-middle" problem. The other is less intuitive: coherent, relevant-sounding but wrong information is more disruptive than clearly irrelevant noise. Which means the documents you exclude might matter more than the ones you include.

The Practical Decisions

Given a finite attention budget, the goal isn't a bigger context. It's a better-curated one.

Start with compression, but don't flatten everything. The instinct when context gets long is to summarize the whole conversation. That's the wrong move. Recent turns carry high information density and should stay verbatim. Older turns can be compressed. A rolling verbatim window for recent history, extractive summary for older turns, semantic retrieval for long-term facts.

Where you put information is as important as what you include. If the model needs to follow a constraint, it goes at the top or bottom of context — not buried in the middle of retrieved documents. The lost-in-the-middle effect is consistent across model families.

Noise isn't neutral — it's actively harmful. Irrelevant-but-coherent retrieved documents don't just waste tokens, they compete for attention with relevant content. A RAG pipeline retrieving 10 documents when 3 are actually useful is degrading output quality. Re-rank and prune before injecting.

Output format definitions belong in the context too. Defining a JSON schema or typed field structure reduces model ambiguity and makes downstream processing reliable.

One Technique Worth Implementing This Week

Context compression with tiered memory is the highest-leverage change most teams can make to an existing agent.

Tier 1 is a verbatim rolling buffer of the last N turns — keep this around 4-6 turns for most applications.

Tier 2 is a compressed summary of older conversation history. When turns age out of Tier 1, they get summarized and stored.

Tier 3 is semantic retrieval from long-term storage: user preferences, facts the agent has been told to remember, decisions made in prior sessions. Only what's relevant to the current query gets retrieved.

Total tokens per request drops significantly. Accuracy on multi-turn tasks improves because the model sees clean, relevant context rather than an ever-growing transcript. This is the pattern Anthropic describes in their context engineering docs, and it's what memory tools like Mem0 implement out of the box.

Tool Spotlight: Mem0

Mem0 (mem0.ai) is an open-source memory layer built specifically for agents. It handles the tiered memory pattern above with a one-line integration.

The core mechanism: Mem0 intercepts conversation history, extracts semantically important facts, and stores them as a compressed, retrievable memory graph. On subsequent requests, it retrieves only the memories relevant to the current query and injects them into context. The raw conversation history doesn't accumulate.

Their benchmarks claim 80% reduction in prompt tokens, 26% higher response quality compared to OpenAI's built-in memory, and 91% faster response times versus full-context approaches. Treat these as directionally useful rather than independently verified.

Works with LangGraph, CrewAI, AutoGen, and direct API calls. The graph-based variant stores memories as a directed labeled graph for applications that need relational reasoning across memories, not just semantic similarity.

The practical floor for considering Mem0: if your agent is hitting context limits on long conversations, or if you're building anything with persistent user state, this is the most efficient path to production-quality memory management without building the plumbing yourself.

Quick Hits

The "lost-in-the-middle" problem now has a name and a benchmark suite. Liu et al.'s 2023 paper formalized it (arxiv.org/abs/2307.03172); Chroma's 2025 work confirmed it holds across the latest generation of models. If you're building RAG, structure your retrieved chunks with the most relevant content first and last, and test across context lengths.

LangChain released a practical context engineering guide that covers tool schema design, message formatting, and how to diagnose whether agent failures are model failures or context failures. Worth bookmarking. (blog.langchain.com/context-engineering-for-agents)

Anthropic's MCP standardizes how external tools, resources, and data sources inject information into a model's context — which means how you architect your MCP server design is itself a context engineering decision.

Structured output has reached the point where not using it in production is a choice you should justify. Every major API provider now supports typed output definitions. The performance improvement on tasks requiring consistent format is consistent enough that "we'll parse it ourselves" is hard to defend.

Builder Tip

Before you add more context, cut some first. Take your current system prompt and ask: what would break if I removed this paragraph? If the honest answer is "probably nothing," cut it. Noisy, redundant instructions consume your model's attention budget without contributing to output quality. The cleanest context windows in production are the ones where every line earns its place.

Gradient Push covers AI and automation for builders. If this was useful, pass it on.

Enjoying this? Subscribe to Gradient Push for practical AI and automation breakdowns — gradientpush.com