Harness Engineering: Everything Around the Model

This is Part 3 of a 3-part series. Prompt Engineering covered the writing layer. Context Engineering covered the assembly layer. This one covers Harness Engineering, the loop around the model and everything in the agent that isn't the model itself.

The phrase "harness engineering" wasn't common vocabulary in mid-2025. Mitchell Hashimoto used it in a personal blog post in early February 2026, OpenAI used it as the title of a case-study post a week later, and Anthropic published a deep dive on harness design in March. By April most teams building on coding agents had adopted the term. This post is about what it means in practice and the patterns that have started to converge.

What is a harness?

A useful working definition comes from Hashimoto's February 2026 post:

Anytime you find an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again.

A common shorthand built on top of it: Agent = Model + Harness. The harness is everything else. The system prompts and AGENTS.md files. The tools and MCP servers. The sandbox the code runs in. The hooks that block destructive commands. The compaction policy. The reviewer subagent. The orchestration code that decides when to retry, when to hand off, when to stop.

Addy Osmani puts the practical implication directly: "A decent model with a great harness beats a great model with a bad harness." There's a striking data point that shows up in a few places. On Terminal Bench 2.0, the same Claude Opus 4.6 model scored quite differently inside Claude Code versus a custom harness, and one team reported moving a coding agent from outside the Top 30 to inside the Top 5 by changing only the harness. Most of the public conversation in 2024 and 2025 was about which model to pick. Harness engineering is the recognition that the model is one input to a running agent, and the rest deserve the same level of design.

What's in a harness

Pulling across these references, a typical coding-agent harness today contains the following. Different products mix and match, and a non-coding agent (a customer support bot, a PM workspace) will look different in detail, but the surface area is the same.

Layer	Examples
Configuration	System prompts, `AGENTS.md`, `CLAUDE.md`, skill files, subagent prompts
Tools	Bash, file edit, grep, MCP servers, custom tools, browser automation
Execution environment	Sandbox, isolated worktrees, language runtimes, headless browsers, observability stack
Context machinery	Compaction, tool-result clearing, persistent memory, context resets, just-in-time retrieval
Orchestration	Subagent spawning, planner/generator/evaluator splits, model routing, handoffs
Hooks and middleware	Pre-commit checks, destructive-action blockers, lint and typecheck on edit, approval gates
Observability	Logs, traces, cost and latency metering, replay

None of this lives in the model's weights. It is all code, configuration and operating policy that you (or your platform vendor) write and maintain. Claude Code, Cursor, Codex and Aider are running similar models underneath, and the behaviour you experience is dominated by what each harness does around them.

A million lines of agent-generated code

OpenAI's Codex case study is the most useful published reference for what a fully built-out harness looks like in production. A team of three engineers (later seven) shipped roughly a million lines of code across about 1,500 PRs in five months, with a self-imposed rule of zero manually-written code. The rule was a forcing function. The job of the engineers became designing the environment instead of writing in it.

Some of the load-bearing pieces in their harness:

Repository knowledge as the system of record. A short AGENTS.md (around 100 lines) acts as a table of contents, with a structured docs/ directory underneath holding architecture documentation, design docs, execution plans, product specs and reference materials. Anything the agent cannot see in the repo, in their framing, does not exist. A Slack thread that aligned the team on a design decision is invisible to Codex unless someone encodes it as markdown in the repo.
Custom linters and structural tests that enforce a strict layered architecture (Types → Config → Repo → Service → Runtime → UI within each business domain, with cross-cutting concerns entering only through a Providers interface). The linter error messages are written specifically to tell the agent how to fix the violation, not just to flag it.
Chrome DevTools Protocol exposed to the agent, so Codex can drive the running app in its own worktree, reproduce bugs, and validate fixes by clicking through the UI.
A local observability stack per worktree (logs, metrics, traces) queryable by the agent with LogQL and PromQL. Prompts like "ensure service startup completes in under 800ms" become tractable because the sensor that measures startup is wired into the harness.
Background agents on a schedule. A doc-gardening agent scans for stale documentation and opens fix-up PRs. A "garbage collection" pass finds deviations from a set of golden principles, updates quality grades and opens targeted refactoring PRs, most reviewable in under a minute and automergeable.

Their summary of the new bottleneck:

Our most difficult challenges now center on designing environments, feedback loops, and control systems that help agents accomplish our goal: build and maintain complex, reliable software at scale.

The operating principle they keep returning to is to treat agent struggle as a signal. Identify what is missing (tools, guardrails, documentation) and feed it back into the repository, having Codex itself write the fix.

Patterns for long-running work

Anthropic's March 2026 post on harness design covers a different shape of the problem: agents that have to work coherently for hours on a single task. Two failure modes recur.

Coherence loss as the context fills. Models drift, lose track, and in some cases exhibit what the team calls "context anxiety" where they start wrapping up work prematurely as they approach what they believe is their context limit. Compaction (summarising in place) helps but doesn't always cure it. The harness response is context resets: tear the session down entirely and start a fresh agent, with a structured handoff artifact that carries the previous agent's state and the next steps. Costs more orchestration and tokens, but cures the anxiety. Once Opus 4.6 reduced the underlying behaviour, context resets could be dropped from later versions of the harness.

Self-evaluation skews positive. When asked to grade its own work, an agent reliably tells you it did well, especially on subjective tasks like design where there is no binary check. The fix here is structural. Separate the agent doing the work from the agent judging it, and tune the evaluator to be skeptical. The evaluator is still an LLM and still inclined to be generous towards LLM outputs, but tuning a standalone evaluator turns out to be much more tractable than turning a generator critical of its own work.

These two ideas combine into a planner / generator / evaluator architecture. A planner expands a one-line user prompt into a full spec. A generator implements the spec, optionally one feature at a time. An evaluator runs the result, often through a Playwright MCP that drives the live application, and grades it against agreed criteria. Before each chunk of work, the generator and evaluator negotiate a sprint contract, agreeing on what "done" looks like and what tests will verify it, then the generator builds against the contract.

A few details from the runs are worth carrying forward. Grading criteria steer the generator more than the team initially expected: wording like "the best designs are museum quality" measurably pushed the generator toward more distinctive aesthetic territory before any feedback loop kicked in. The right number of agents shifts with the model: on Sonnet 4.5 the sprint construct and per-sprint evaluator were load-bearing for keeping work coherent, but on Opus 4.6 the team could drop the sprint construct and run the generator as one continuous session, with the evaluator only invoked when the task sat at the edge of what the model could do reliably solo. Simplifying the harness is also harder than building it. The first instinct was to cut everything back radically, but that made it hard to tell which pieces were load-bearing. The discipline that worked was removing one component at a time and reading the impact on the output.

The principle from the same post that is the most useful one to hold onto:

Every component in a harness encodes an assumption about what the model can't do on its own.

When the model gets better at something, the corresponding component should come out. When the model unlocks a new ceiling, new scaffolding is needed to reach it.

Mistakes turn into rules

A harness mostly accretes from observed failure rather than getting designed up front. The agent runs the wrong test command, so a line goes into AGENTS.md. The agent ships a PR with a commented-out test, so a pre-commit hook greps for .skip( and xit( in the diff, and a reviewer subagent flags commented-out tests as a blocker. The agent gets lost in a 40-step task, so the work is split into a planner and an executor. The agent kept "finishing" broken code, so a typecheck back-pressure signal gets wired into the loop, injecting the failure into the next iteration.

Earn each line: every rule should trace to a specific past failure or a hard external constraint, because rules that don't are noise, and noise dilutes the rules that matter (some teams cap their AGENTS.md at well under a hundred lines for exactly this reason). Make success silent and failures verbose: if a lint passes the agent hears nothing, but if it fails the error message is engineered to inject into the loop and tell the agent what to do about it. The feedback loop is close to free in the common case and directly actionable when something goes wrong.

A useful reframe is that an agent failure is usually a configuration problem before it's a model problem. The reflex of "wait for the next model" leaves the leverage on the floor. Asking "what configuration would have prevented this" turns the failure into a permanent improvement.

Better models don't make the harness redundant

A natural worry, especially for anyone investing in scaffolding today, is that the next model will obsolete the harness. The pattern observed so far is that the harness keeps moving rather than shrinking.

Anthropic's experience is the cleanest evidence. Opus 4.6 effectively eliminated the context-anxiety failure mode that Sonnet 4.5 exhibited, which meant a whole class of context-reset scaffolding could be retired. At the same time, the longer coherent runs that 4.6 made possible introduced new failure modes (drift over multi-hour builds, subtle bugs in deeply nested features that the evaluator wasn't probing) that needed their own scaffolding.

There's also a feedback loop running in the other direction. Today's agent products are post-trained with their harnesses in the loop. Models get specifically better at the operations the harness designers think they should be good at: filesystem operations, planning, subagent dispatch, tool use through bash. That's why a model can feel different inside two different harnesses, and why moving a model from the harness it was trained inside to a different one can either unlock latent capability or surface new failure modes.

One longer-term direction is the idea of harness templates. Most engineering organisations already have two or three main service topologies (a CRUD business service, an event processor, a data dashboard). In a few years those templates may come pre-bundled with their own custom linters, structural tests, agent skill files and observability hooks, all matched to the topology and the tech stack. Picking a stack would partly mean picking which harness template you can stand on.

Who controls what (the final time)

This series started with a small drawing of who controls which layer. Worth closing it out with the full picture.

The user message is what you write. The system instruction is set by whoever built the platform. The context (history, retrieved documents, project files, tool outputs, your onboarding profile) is triggered by you and assembled by the platform. On a raw model API it might be as little as a project folder; on a product platform it becomes an engineering discipline of its own.

The harness is almost entirely platform code. It decides when to call the model, when to call a tool, when to retry, when to hand off to another agent, when to ask you for confirmation, when to stop. You experience it as the difference between a chatbot that answers questions and a system that gets work done.

When a product moves from "wrap the model" to "run agents," the work shifts down through the stack from prompts to context to harness, and the proportion of value created inside the platform grows at every step. At Idam AI this looks concretely like a callbacks pipeline that runs guardrails, cache lookups, context injection and logging on every turn; agent-specific planners that vary by task; an MCP toolset whose lifecycle is managed for the user; and SSE streaming that keeps long agent runs feeling responsive. None of that is visible in a user message, but all of it shapes the output.

A short checklist

If you are building or evaluating a system that wraps an LLM, run through these:

Can you name, for every component of your harness, the model failure it exists to address?
When you remove a component, can you tell whether it was load-bearing?
Are sensor signals (lint errors, test failures, review comments) written to be consumed by the model, not just read by humans?
Do you have a clear policy for context growth: compaction, tool-result clearing, persistent memory, full resets, or some combination?
When the agent produces output that should be evaluated, is the evaluator a separate agent with its own prompt and (if helpful) tools?
Are your most frequent agent mistakes making it into your AGENTS.md, your hooks, or your reviewer prompts, or are they being retried and forgotten?
When a new model lands, do you re-examine the harness, removing scaffolding that is no longer load-bearing and adding scaffolding for the new ceiling?
Can you replay any agent run end-to-end, with the full state the agent saw at every step?

Almost every meaningful improvement to a harness starts with reading a real failure trace and asking what would have caught it.

Sources

That closes out the series. The same primitives (clarity, examples, structure, feedback) keep showing up at each layer; what changes is who owns them. If you're building a product on top of an LLM, the centre of gravity moves from prompt to context to harness as the product matures.

Your AI Copilot for all things Product Management

Idam AI helps you streamline product tasks, enhance decision-making, and accelerate your product development lifecycle with intelligent automation.

Get Started with Idam AI

Table of Contents