Context Engineering: What the Model Sees, and Who Decides

This is Part 2 of a 3-part series. Prompt Engineering covered the writing layer. This part is about Context Engineering, the assembly layer underneath. Part 3 will cover Harness Engineering, the loop around the model.

In late June 2025 a phrase started showing up everywhere. Tobi Lütke posted that he liked "context engineering" over "prompt engineering" because it described the core skill better. Andrej Karpathy amplified it the same week. Simon Willison wrote that he thought the term would stick. By September Anthropic had published a long engineering post built around it, and by March 2026 a follow-up cookbook went deep into the operational primitives. By 2026 it was the default term for a thing a lot of people had been doing without a name.

This post walks through what the term means in practice, what the main references say, and why on a real product platform context engineering is mostly not a writing problem.

The definition that stuck

Karpathy's framing in one sentence:

Context engineering is the delicate art and science of filling the context window with just the right information for the next step.

He goes on to list what "right information" means in practice: task descriptions and explanations, few-shot examples, RAG, related (possibly multimodal) data, tools, state and history, compacting. And he names the trade-off directly. Too little or of the wrong form and the LLM doesn't have the right context for optimal performance. Too much or too irrelevant and the LLM costs might go up and performance might come down.

Bigger context windows did not make context engineering less important; they made it more important. The failure mode shifted. You used to fail by not fitting enough in. You now fail by stuffing in everything you have and watching quality degrade.

Simon Willison made the case for why the term itself was useful. He had defended "prompt engineering" for years, hoping it could capture the actual difficulty of the work. It didn't. People kept hearing it as a fancy name for typing into a chatbot. His point about "context engineering" was simpler:

It turns out that inferred definitions are the ones that stick. I think the inferred definition of "context engineering" is likely to be much closer to the intended meaning.

The word "construct" recurs in his piece ("carefully and skilfully construct the right context"), and there is no longer an assumption that one human is doing the constructing in a chat box. Most context, on a real product, is built by code.

Tobi Lütke's version of the definition is the shortest one in circulation:

The art of providing all the context for the task to be plausibly solvable by the LLM.

It is the cleanest working definition of the three.

Context as a finite resource

The clearest single framing comes from Anthropic's September 2025 engineering post. They put the underlying constraint in front:

Context, therefore, must be treated as a finite resource with diminishing marginal returns. Like humans, who have limited working memory capacity, LLMs have an "attention budget" that they draw on when parsing large volumes of context. Every new token introduced depletes this budget by some amount, increasing the need to carefully curate the tokens available to the LLM.

The reason is architectural. Transformers let every token attend to every other token, which produces n² pairwise relationships for n tokens. As context length grows, the model's ability to capture those relationships gets stretched thin. Studies on needle-in-a-haystack benchmarks have given this a name: context rot. As the number of tokens in the context window increases, the model's ability to accurately recall information from that context decreases. Models also see shorter sequences far more often during training than longer ones, so they have less practice with long-range dependencies.

That gives the practical guiding principle, in Anthropic's words:

Good context engineering means finding the smallest possible set of high-signal tokens that maximize the likelihood of some desired outcome.

Every section that follows is a strategy for doing that.

The seven components

Before talking about strategies, it helps to be specific about what is actually in a "context window" on a real production turn. Philipp Schmid, then at Hugging Face, decomposed it into seven components.

#	Component	What it is
1	Instructions / System Prompt	The behavioural contract: persona, rules, examples, safety boundaries
2	User Prompt	The immediate task or question from the user
3	State / History (short-term memory)	The current conversation, including user and model turns that led to this moment
4	Long-term memory	Persistent knowledge gathered across many sessions: preferences, summaries of past projects, facts the system was told to remember
5	Retrieved Information (RAG)	External, up-to-date knowledge from documents, databases or APIs
6	Available Tools	Definitions of all the functions the model can call
7	Structured Output	The schema the response must conform to (e.g. a JSON shape)

Schmid's own definition is worth quoting in full:

Context Engineering is the discipline of designing and building dynamic systems that provides the right information and tools, in the right format, at the right time, to give a LLM everything it needs to accomplish a task.

Three words in there are doing real work. Dynamic: built on the fly, different per request. System, not string: the output of code that runs before the main LLM call. Right format: a concise summary beats a raw dump; a clear tool schema beats a vague instruction.

Looked at this way, prompt engineering (Part 1 of this series) is mostly the art of writing components 1, 2, and 7. Context engineering is the art of deciding which other components belong in the window for this specific turn, in what shape, and within what budget.

A worked example

Schmid's piece has a small example that makes the abstract definition concrete. The same user message gets dramatically different output depending on what the system stitches together around it.

User types: "Hey, just checking if you're around for a quick sync tomorrow."

A thin pipeline sends roughly this to the model:

[system] You are a helpful assistant.
[user] Hey, just checking if you're around for a quick sync tomorrow.

The reply reads back as a generic acknowledgement. It works, in the sense that words come out, and it is useless.

A context-engineered pipeline assembles this before the call:

[system]
You are a scheduling assistant for {user_name}.
Tone: warm, brief, casual. Use the user's voice.

[long-term memory]
- Sender: Jim, longstanding partner. Informal tone is normal.

[retrieved: calendar]
- Tomorrow: back-to-back, no available windows.
- Thursday morning: free 9:00-12:00.

[available tools]
- send_invite(date, time, attendee)
- send_email(to, subject, body)

[user]
Hey, just checking if you're around for a quick sync tomorrow.

The reply now reads like the user wrote it: "Hey Jim, tomorrow's packed. Thursday AM works if that's good for you. Sent an invite." The model didn't get smarter between the two cases. The system around the model did more work before the model ever saw the message. As Schmid puts it, "agent failures aren't only model failures; they are context failures."

The anatomy of effective context

Anthropic walks through the components most production teams have to think about and gives concrete guidance on each.

System prompts. They argue for what they call the right altitude: specific enough to guide behaviour, flexible enough to leave the model strong heuristics. Two failure modes to avoid. One is hardcoded if/else logic that tries to script every edge case (brittle, hard to maintain). The other is vague high-level guidance that assumes shared context (under-specified, model has to guess). They recommend organising prompts into clear sections (<background_information>, <instructions>, ## Tool guidance, ## Output description) and starting with the minimal version that works, then adding only what failure modes prove necessary.

Tools. Tools are the contract between an agent and its information/action space, so they should be self-contained, robust to error, and unambiguous. The most common failure mode they call out is bloated tool sets that cover too much functionality or have overlapping decision points. Their rule of thumb: "if a human engineer can't definitively say which tool should be used in a given situation, an AI agent can't be expected to do better." A small, well-curated set beats a large one.

Examples. Few-shot prompting still works, but stuffing a laundry list of edge cases into the prompt does not. Curate a small set of diverse, canonical examples that portray the expected behaviour, rather than trying to enumerate every rule.

The shared thread across all three: be informative, but tight. Every token has an opportunity cost.

Just-in-time retrieval

A lot of the early "AI-native" stack assumed you would embed everything you might need, search ahead of time, and stuff the top results into the prompt. Anthropic describes a shift away from that toward what they call just-in-time context.

Rather than pre-processing all relevant data up front, a just-in-time agent maintains lightweight identifiers (file paths, stored queries, web links) and uses tools to load data into context only when needed. Claude Code is the canonical example: it doesn't pre-embed your repository. It runs glob, grep, and reads files on demand, the same way a human navigates a codebase. The metadata it sees along the way (folder names, file naming conventions, timestamps) is itself signal about what is relevant.

This gives you progressive disclosure: the agent assembles understanding layer by layer, keeping only what is needed in working memory and using note-taking when something needs to persist. The trade-off, which Anthropic names directly, is that runtime exploration is slower than a pre-computed lookup. The right answer is often hybrid. Claude Code drops CLAUDE.md files into context up front, then uses on-demand tools for the rest.

For more dynamic content (a live codebase, a working calendar, a constantly changing project), just-in-time wins on freshness and on token efficiency. For static and well-bounded content (legal corpora, reference docs), pre-loaded retrieval still has its place. The choice is workload-specific.

Three primitives for long-running agents

A single-turn chatbot has one shot to assemble the context. An agent runs for many turns, each producing output that becomes input to the next turn. Anthropic's March 2026 cookbook frames the operational problem this way:

A common challenge when building long-horizon agents is managing context. Tool results, the model's own reasoning, and user messages all accumulate, and eventually you either hit the token limit or start paying for context that isn't helping anymore.

They focus on three primitives that can be combined or used independently. Each addresses a different shape of the problem.

Compaction. Take a conversation that is approaching the context-window limit, summarise it, and continue from the summary. Compaction is a whole-transcript operation. It flattens user messages, assistant turns, tool calls, tool results, even prior compaction blocks into a single condensed summary. The skill is in choosing what to keep: high-level facts and decisions usually survive, obscure specifics from deep in tool outputs usually don't. Compaction handles all kinds of context growth, but it costs an inference call and is lossy by design.

Tool-result clearing. Walk the message list and replace the content of old tool_result blocks with a short placeholder, while leaving the matching tool_use record in place. The model still knows the call was made and what arguments were used; the bulky payload is gone. This is the cheapest of the three: no inference cost, just a mechanical edit. It works well for re-fetchable data (file reads, API queries, search results) where the agent rarely needs to see the old raw content again, and can simply call the tool again if it does.

Memory. Structured note-taking the agent does itself, persisted outside the context window. Notes get pulled back in on demand, and they survive across sessions. Anthropic ships this as a first-party tool whose backend you implement (file system, database, blob store, your call). The agent decides what to write and when to read, and a default protocol is auto-injected so it always checks /memories first. Memory is lossless on whatever the agent chose to save, and is the only one of the three that addresses the cross-session problem.

A short mental model from the same post: "compaction compresses the whole window when it grows too large, clearing drops stale re-fetchable data inside the window, and memory moves information out of the window so it survives across sessions."

The Anthropic team illustrates this with a research agent reading eight ~40K-token review documents. Without management, the context climbs past 330K tokens by the end of a single run. With clearing alone, the peak stays at 173K. With all three primitives layered (clearing trigger, compaction trigger, memory tool active throughout), peak stays at 170K and the final context drops to under 14K because clearing keeps firing as new tool results pile up. The real point isn't the numbers. It's that each primitive targets a different bottleneck, and "which one do I need?" depends on which bottleneck your workload actually has.

Sub-agent isolation

Anthropic's earlier framing adds a fourth pattern that doesn't quite fit the three above: instead of managing one large context, split the work across specialised sub-agents that each get their own clean window. The lead agent coordinates with a high-level plan; sub-agents do focused work and return only condensed summaries (often 1,000 to 2,000 tokens of output from tens of thousands of tokens of internal exploration). They reported substantial gains over single-agent systems on complex research tasks, at the cost of using up to 15x more tokens overall.

Cognition disagree, and both views are worth holding in mind. Their position, from Don't Build Multi-Agents:

Context engineering is effectively the #1 job of engineers building AI agents.

Their argument is that multi-agent systems are fragile precisely because context-sharing between agents is hard, and any system that splits context across processes pays an integration tax. Both teams are right about their own systems. The honest read is that multi-agent isolation is a powerful tool when the sub-tasks really are independent (broad research, parallel exploration), and a footgun when the sub-tasks need to share state (most coding work, most multi-step product workflows).

The failure modes are now well-named

Drew Breunig catalogued four, and most production teams have hit all of them:

Context Poisoning, where a hallucination earlier in the trajectory makes it into the context and gets treated as ground truth on later turns.
Context Distraction, where the volume of context overwhelms what the model learned in training and it starts pattern-matching to what is in front of it.
Context Confusion, where superfluous context (extra retrieved chunks, unused tool descriptions) influences the response in ways no one wanted.
Context Clash, where two parts of the context disagree (the system prompt says one thing, a retrieved doc says another) and the model picks unpredictably.

All four are about what is in the window, not how nicely the user phrased the request. They are problems you fix in the assembly layer.

Why this stops being a writing problem

There is a useful test for whether you are doing prompt engineering or context engineering. Ask: for two consecutive requests from the same user, will the contents of the context window be different?

If no, it is a prompt. If yes, something is choosing what changes, and that something is an engineering system.

Some of what that system has to do, drawn across the references:

Decide what to retrieve. Not every question needs every document. Real systems combine semantic search with keyword search, recency filters, source-of-truth rules, and hard scopes (this user, this workspace, this project). Retrieve too narrowly and the model invents an answer. Retrieve too widely and the relevant snippet gets buried.
Decide what to remember. Conversation history grows linearly, context windows do not in any practical sense. Compaction, tool-result clearing, and persistent memory are the three levers here.
Decide what tools to expose. Every tool description costs tokens, and a model with thirty available tools chooses worse than the same model with the right four. Production systems often route through a smaller classifier first to decide which subset of tools belong in a given turn.
Decide what to inject silently. A logged-in user has a workspace, a role, a set of integrations, an onboarding profile. None of that should be retyped each turn. The platform fetches it and stitches it into the system prompt.
Decide what to evict. When you exceed the budget, something has to go. The order in which you drop things (older turns first, large tool outputs first, lower-relevance retrieved chunks first) is a design choice with quality consequences.

These are all code paths, not prompts.

Who controls what (revisited)

In Part 1 we drew the line between the user message (which you write) and the system instruction, context, and harness (which the platform writes). Context engineering is the middle piece of that drawing.

If you are typing into a raw chatbot, you do most of your context engineering by hand. You paste documents into the message, repeat relevant facts each turn, attach files. It works for short tasks and falls over for long ones, because you become the operating system manually paging things in and out.

If you are using a product platform built for a job (a coding assistant, a customer support agent, a PM workspace), the platform should be doing the context engineering for you. It should know your workspace, fetch the right documents, remember decisions across sessions, route to the right specialised agent, evict stale state, and never make you re-explain who you are. At Idam AI this looks like onboarding data injection, a lightweight cross-session index that summarises past sessions in around 150 tokens, and on-demand expansion of any one of those sessions into the current agent's context. Different problem domain, same primitives.

Put another way: prompt engineering is what you can do from outside the system, and context engineering is what only the system can do. The further AI products move from chat-with-a-model toward jobs-to-be-done, the more of the work moves into this middle layer.

A short checklist

If you are building or evaluating a system that calls an LLM, run through these:

Are you assembling context per-turn, or sending the same prompt every time?
Are stable components (system prompt, tool descriptions) at the top of the prompt and cacheable?
Do you have a strategy for what gets retrieved, and how it is ranked and trimmed?
Do you have a strategy for what happens when context approaches the budget? (Compaction? Tool-result clearing? Both?)
Do you have persistent memory for facts that should survive a session reset?
Do you only expose the tools the model is likely to need for this turn?
Do you inject what you already know about the user (role, workspace, preferences) without making them retype it?
Can you log and replay the exact context the model saw on any given request?

The last item is the most important in practice. When a model gets something wrong, the first question is rarely "was the prompt bad?" It is "what did the model actually see?" If you cannot answer that exactly, you cannot fix it.

Sources

Next up: Harness Engineering. The orchestration layer: how production systems decide when to call the model, when to call a tool, when to retry, and when to stop.

Your AI Copilot for all things Product Management

Idam AI helps you streamline product tasks, enhance decision-making, and accelerate your product development lifecycle with intelligent automation.

Get Started with Idam AI

Table of Contents