What is an agent harness?
What is an agent harness?
This is the post I've been putting off writing, not because I don't know what to say but because I've been trying to figure out where to start. The concept of a harness sounds like it should be simple to explain, but every time I've tried to define it for someone, I find myself reaching for analogies before I reach for definitions, which is usually a sign that the thing is best understood through contrast rather than description.
So let me start there.
Open Claude Code and type "fix the failing test in the login page." A few minutes later, the test passes, two files have changed, and you didn't touch the keyboard. Then imagine opening a chat window with the same underlying model (same training, same weights, same intelligence) and typing the exact same thing. What comes back is a thoughtful paragraph about what might be causing the failure. Maybe some code you'd have to copy somewhere. Maybe a suggestion to check a particular file.
Same model. The results look nothing alike.
The difference is not the model. The difference is the system wrapped around it.
A harness is everything the model can't provide for itself
A language model, at its core, is a function that takes text in and produces text out. It has learned a great deal about the world, about code, about language and reasoning, from the data it was trained on, but it cannot act on any of that knowledge without infrastructure that turns its outputs into actions. It can describe how to fix a test. It cannot run the test.
A harness is that infrastructure. It is the loop that keeps the model running until a task is complete. It is the tools (reading files, writing code, executing commands) that let the model's outputs become effects in the world. It is the context pipeline that decides what the model sees on every turn, and what gets dropped when the memory fills up. It is the verification that checks whether an action actually worked. It is the recovery logic that decides what to do when it didn't.
Without the harness, even a highly capable model is essentially passive. With a good harness, it becomes something that ships work.
I like to think of it the way I think about a chef and their kitchen. The best chef in the world, dropped into an empty room, ships nothing. Put a merely competent cook in a well-designed kitchen (sharp knives within reach, ingredients prepped, a sous chef checking temperatures, a system for getting orders to the pass in the right sequence) and plates go out. The cook's skill matters. But the kitchen is often what actually determines how much gets done.
The loop at the center of everything
The core of every agent harness is a loop. Not a code loop in the technical sense (though it is implemented as one), but a work loop, the kind of iterative process any craftsperson runs without thinking about it: try something, see what happened, decide what to do next, repeat until done.
One turn of that loop, in a harness like Claude Code, looks like this:
1. Assemble context: gather the task, the conversation so far, the results of the last tool call, any relevant project instructions, and build the prompt the model will actually see. What to keep, what to summarize, and what to drop is one of the hardest engineering problems in the whole loop.
2. Call the model: the model responds with reasoning and, usually, a request to do something: read a file, run a command, write a change.
3. Execute the action: the harness actually performs what the model asked. This is the step that separates an agent from a chatbot. How the harness designs its tools, and especially what it returns when things go wrong, shapes the model's ability to self-correct.
4. Inject the result: the output of that action becomes the new context on the next turn. The model sees what happened.
5. Decide: the model judges whether the task is done, whether to keep going, or whether something failed in a way that requires a different approach. It signals this by either requesting another tool call or returning a final response. The harness enforces structural limits (maximum turns, token budgets) but the core judgment about completion belongs to the model.
The interesting engineering is not concentrated in any single step. It is spread across the loop: deciding what context the model sees, designing tools that teach the model to recover from errors, enforcing guardrails without cutting the model off too early. Claude Code might run this loop hundreds of times on a single task (reading files, running tests, making edits, verifying the edits, fixing what the verification broke). Each individual turn is a small unit of work. The loop is what makes the whole thing feel like an agent rather than a very fast typist.
A chatbot runs steps 1 and 2 and stops. An agent keeps going.
The five things a harness actually does
When I started thinking seriously about harness design, I found it useful to decompose what a harness does into five distinct functions, because when something went wrong, I needed to know which part was responsible.
Tools are the verbs the model can use. Read, Edit, Write, Bash, WebFetch: the actions the harness exposes that let the model affect the world. Tool design turns out to matter as much as tool selection. A tool that returns a bare error code teaches the model nothing. A tool that returns a contextual error ("file not found at /src/auth.ts, did you mean /src/auth/index.ts?") teaches the model to self-correct. The ceiling of what an agent can accomplish is shaped, in large part, by the expressiveness of its tools and the quality of what they return when things go wrong.
Context management is what goes into the prompt on every turn. This sounds mechanical until you realize that context windows, even very large ones, are finite, and that long agent runs will fill them. I have watched functioning harnesses fall apart past turn thirty because the context pipeline was naive: just appending everything and hoping for the best. Good harnesses are aggressive about what they keep. They summarize completed subtasks, drop old tool results that are no longer relevant, pin the current plan where the model can always see it, and separate working memory from conversation history. Claude Code compacts automatically when the context fills, summarizing what it can afford to lose and re-injecting what it can't.
Verification is how the harness knows a step actually worked. Running tests after an edit. Type-checking after a refactor. Re-reading a file to confirm the write landed as intended. This is the component that most naive harnesses skip, because the model is so confident about its outputs that verification feels unnecessary, right up until the moment the model confidently reports that it fixed the bug while the tests are still failing. Verification is what turns confident assertions into grounded facts.
Control flow is the decision logic around the loop. When to plan before acting. When to delegate a subtask to a separate agent rather than handling it inline. When to ask the user something rather than guessing. When to retry with a different strategy versus give up entirely. I think of this as the harness's executive function, the part that determines whether the model spends forty turns thrashing or eight turns solving. Most of it is implemented as instructions and hooks rather than explicit code. By instructions, I mean the system prompts and injected guidance that tell the model when to plan, when to delegate, and when to ask for help. By hooks, I mean the shell commands the harness runs automatically in response to events (before a file write, after a tool call, on error). This is part of why harness engineering feels more like system design than programming.
Recovery is what happens when things break. Structured error handling, retry with fallback, subagent isolation to prevent a failed exploration from corrupting the main context, rollback when an edit made things worse than before. Recovery is the difference between a demo and a production harness. Demos show the happy path. Production harnesses live in the unhappy path and have to be designed for it.
What a harness is not
I want to draw a few contrasts, because the word "agent" has been applied to so many different things that it has become nearly meaningless, and part of what I want to do in this series is establish a shared vocabulary before we need it later.
A workflow is not a harness. A workflow is a fixed sequence of steps decided at design time: do A, then B, then C, then report back. The path is predetermined. A harness lets the model decide the path at runtime, based on what it finds as it goes. That flexibility, and the responsibility that comes with it, is the whole point.
A RAG pipeline is not a harness. Retrieval-augmented generation retrieves relevant context and generates a one-shot response. It is passive and bounded. A harness is iterative and acts on the world. The outputs of a RAG pipeline are text. The outputs of a harness can be changed files, executed commands, modified databases.
A chatbot is not a harness, though a chatbot can become one. What separates them is the loop and the tools. Add tool-calling with real effects and a loop that runs until the job is done, and you have started building a harness. The distance between a chat interface and an agent is shorter than most people think, and also harder to close than most people expect.
Why this matters more than which model you use
The conversation around language models is almost entirely about the models themselves. Every few weeks there is a new release, a new benchmark, a new claim about which system is at the frontier. I understand why. Models are legible, benchmarks are publishable, and "we improved the model" is a clean story to tell.
What gets less attention is that the same model, inside a well-engineered harness, performs noticeably better than the same model in a naive one. I have seen the gap be larger than the gap between models a full generation apart. The harness is where the leverage is, and for most teams building agentic systems, improving the harness is a better investment than upgrading the model.
This series is about that claim: what the evidence for it looks like, what the components of a good harness actually are, and eventually, whether a thoroughly engineered harness around a capable open-weights model can match what the frontier labs are producing. The next post takes apart what production harnesses look like under the hood, using Claude Code as the primary case study (I have already written ten posts on its internals, so this felt like the natural continuation).
If you have been swapping models hoping for better results, there is a reasonable chance you have been pulling the wrong lever.
