Harness Engineering: Anthropic's Practical Guide to Building Production AI Agents

The AI engineering landscape has undergone three distinct evolutions in just four years. From crafting perfect prompts in 2022, to assembling context in 2025, we’ve now entered the era of Harness Engineering—where the infrastructure surrounding AI models matters more than the models themselves.

The Three Generations of AI Engineering

2022-2024: Prompt Engineering The focus was on writing the perfect instruction. Techniques like few-shot learning, chain-of-thought, and role-playing were all about optimizing single inputs.

2025: Context Engineering Single prompts weren’t enough. Engineers learned to dynamically construct entire contexts—relevant files, conversation history, tool definitions, knowledge base results—so models could make informed decisions.

2026: Harness Engineering Context engineering is now table stakes. Harness Engineering operates at a higher level: constraints, feedback loops, architectural rules, tool chains, and lifecycle management. It’s about creating an environment where agents can work continuously, stably, and at high quality.

The Evidence: Same Model, Different Shell

The data is compelling:

Nate B Jones: Same model, different harness → success rate jumped from 42% to 78%
Anthropic: Same prompt, same model → 20 minutes/$9 produced broken core functionality, while 6 hours/$200 delivered a playable game
LangChain: Terminal Bench 2.0 score rose from 52.8% to 66.5% by changing only the harness
Terminal Bench 2.0: Claude Opus 4.6 ranked #33 with one harness, #5 with another—same model, 28-position difference

The conclusion is clear: optimizing the shell around models may yield higher returns than waiting for the next generation of models.

Anthropic’s Evolution: What to Stop Doing

Anthropic’s second blog post on Harness Engineering answers a critical question: “What can I stop doing?”

As models improve, your harness needs to evolve. Here’s what Anthropic learned:

1. Use Tools Claude Already Knows

When Claude 3.5 Sonnet achieved 49% on SWE-bench Verified, it used just two tools: bash and text editor. Not custom agent tools—generic developer tools that every version of Claude gets better at using.

The insight: Don’t build new tools for Claude. Let it compose solutions from tools it already knows.

Agent Skills, Programmatic Tool Calling, Memory Tools—all emerged from combining bash and text editor in novel ways.

2. Let Models Orchestrate Themselves

Traditional agent harnesses assume every tool call result must return to the model’s context window for the next decision. But this wastes tokens.

The old way: Read a large table to analyze one column → all rows consume tokens.

The new way: Give Claude a code execution tool (bash or REPL). Let it write code to call tools, filter results, and chain logic. Only final output returns to context.

This shifts orchestration power from harness to model. And because code is a universal orchestration language, strong coding models naturally become strong general agents.

Real impact: On BrowseComp (web search agent benchmark), giving Opus 4.6 self-filtering capability increased accuracy from 45.3% to 61.6%—on a non-programming task.

3. Let Models Manage Their Own Context

Traditional approach: Pre-load all task instructions into system prompt. Problem: More instructions = tighter attention budget, and most instructions aren’t relevant most of the time.

The solution: Skills

Each skill’s YAML frontmatter provides a brief description loaded into context for overview. Full content loads via read file tool only when needed.

This gives Claude freedom to assemble its own context window.

Context editing is the inverse: selectively delete outdated context (old tool results, thinking blocks).

Subagents let Claude know when to fork a clean context window and isolate subtasks. Opus 4.6’s subagent capability improved BrowseComp by 2.8% over best single-agent runs.

4. Let Models Manage Their Own Memory

Long tasks exceed single context windows. The traditional approach builds retrieval infrastructure around models.

Anthropic took a different path: give Claude simple ways to choose what to save.

Compaction lets Claude summarize historical context to maintain continuity. But the same compaction mechanism produced vastly different results across models:

Sonnet 4.5: stuck at 43%
Opus 4.5: reached 68%
Opus 4.6: achieved 84%

This proves models themselves know what to remember and what to forget.

Memory folders offer another approach: give Claude a read-write folder and let it decide what to persist. This improved Sonnet 4.5’s BrowseComp-Plus score from 60.4% to 67.2%.

In a Pokémon game test:

Sonnet 3.5 (14,000 steps): Still in second town, 31 files including duplicate caterpillar notes
Opus 4.6 (same steps): 10 files organized in directories, three gym badges, plus a “learnings” file extracted from failures

The evolution: from “recording NPC dialogue” to “documenting battle losses.” Same mechanism, smarter model.

What Still Needs Human Engineering

“Doing less” doesn’t mean doing nothing. Anthropic shared extensive guidance on when to keep harness constraints:

Cache Design

The Messages API is stateless—every conversation round requires sending full history. Cached tokens cost only 10% of base input, so maximizing cache hit rate directly impacts costs.

Five principles:

Put dynamic content at the end of prompts
Append new messages instead of treating each as single-turn
Don’t switch models mid-conversation (breaks cache)
Carefully manage tools (adding/removing invalidates cache)
For multi-turn agents, move breakpoints to latest message

Declarative Tools for Boundaries

If all Claude operations go through bash, every action looks identical to the harness—just a command string. Deleting a file and calling an external API have the same “shape.”

But these operations have vastly different risk profiles.

Anthropic’s recommendation: Extract actions requiring safety controls, user interaction, or audit trails from bash into independent tools.

Example: Claude Code’s edit tool is independent, not a bash command. This lets the harness check for file staleness before editing, preventing overwrites. If edits used bash sed, the harness wouldn’t know which files changed.

Similarly, operations needing user confirmation (external API calls) become tools that can trigger confirmation dialogs. Information for user display (questions) becomes tools rendering as UI components.

Decision criterion: Reversibility. The harder an operation is to undo, the more it deserves to be an independent tool.

Claude Code’s auto-mode offers an alternative: use a second Claude instance to review bash commands for safety. This reduces the need for independent tools but only suits scenarios where users trust the overall direction.

This decision itself requires continuous re-evaluation.

The Reality: The Shell Keeps Evolving

Is Noam Brown right when he says “Harness is like a crutch we’ll eventually transcend”? He’s half right.

Anthropic’s own experiments prove it. Their early harness with Opus 4.5 required breaking work into sprints with context resets between each, because 4.5 had “context anxiety”—it would prematurely wrap up when approaching context limits.

Then Opus 4.6 arrived with better planning, long-task stability, long-context retrieval, and self-debugging. Anthropic immediately eliminated the sprint structure, letting the generator run entire builds continuously—over two hours without crashing.

The sprint “crutch” was discarded.

But the evaluator wasn’t.

Even as Opus 4.6 got stronger, its capability boundary just moved outward—it didn’t disappear. Within that boundary, the generator works independently and the evaluator becomes overhead. But near the boundary, the evaluator still catches issues the generator misses—surface-only implementations, subtle API routing bugs, missing interaction logic.

Anthropic’s conclusion: The possibility space of harness doesn’t shrink with model progress. It shifts.

Models get stronger, old constraints can be removed, but new higher-order constraint spaces open up. You used to teach agents context window management—now you don’t. But you can now let them do 4-hour autonomous development tasks, requiring new feedback and acceptance mechanisms.

It’s not just Anthropic. Manus restructured their harness 5 times in 6 months. LangChain re-architected their research agent 3 times in a year. Vercel cut 80% of their agent tools.

Harness isn’t one-time engineering. It’s a continuously evolving system.

Production Validation: OpenAI’s Million-Line Experiment

OpenAI’s Codex team provided the industry’s reality check on “the engineer’s role.”

Starting from an empty git repository, 5 months, approximately 1 million lines of code, 1,500 PRs—all generated by agents. Humans wrote zero lines of code.

The team started with 3 engineers, expanded to 7. Using GPT-5-powered Codex CLI, they built a complete production-grade application from scratch. Average 3.5 PRs merged per engineer per day. Traditional manual coding would have taken 10x longer.

Core engineer Ryan Lopopolo’s summary: “Agents aren’t hard. Harness is hard.”

Five hard rules from 5 months of practice:

The repository is the agent’s only knowledge source
Code must be readable to agents, not just humans
Architecture constraints use linters, not prompts
Autonomy gets granted incrementally
If a PR needs major changes to merge, the problem isn’t the agent—it’s the harness

OpenAI’s self-assessment: “Our biggest challenge now is designing environments, feedback loops, and control systems.”

Not writing code. Writing rules.

The Cursor team discovered a counter-intuitive insight: constraining the solution space actually makes agents more productive.

When models can generate anything, they waste tokens exploring dead ends. When harness defines clear boundaries, agents converge to correct answers faster.

Not Just OpenAI

If only OpenAI were doing this, Harness Engineering might be hype. But multiple companies arrived at the same conclusion simultaneously.

Stripe’s “Minions” merges 1,300+ PRs weekly, all from autonomous agents.

Their architecture has a noteworthy design: Blueprint orchestration splits workflows into deterministic nodes and agentic nodes.

Deterministic nodes (run linters, push changes) execute fixed paths without calling LLMs
Agentic nodes (implement features, fix CI) let models decide

Stripe enforces a hard rule: CI runs maximum twice. First failure → agent auto-fixes and reruns. Second failure → escalates to humans. No infinite agent retry loops.

Their tool platform hosts ~500 MCP tools, but each agent gets a carefully curated subset. They discovered: more tools ≠ better performance.

Stripe engineering’s conclusion: Success depends on reliable developer environments, test infrastructure, and feedback loops—not model choice.

Cursor’s “Self-Driving Codebases” went further—~1,000 commits/hour, 10+ million tool calls/week, zero human intervention after launch.

But their journey through dead ends proves harness design difficulty:

V1 Single agent—couldn’t handle complex tasks
V2 Multi-agent shared state files—severe lock contention, agents fighting each other
V3 Structured role division—too rigid
V4 Continuous executor—role overload
Final Recursive Planner-Worker model

They discovered a darkly humorous phenomenon: a vague initial instruction gets amplified across hundreds of concurrent agents. One agent’s mistake × hundreds of concurrent runs = predictable disaster.

The Core Problem: Models Don’t Self-Evaluate

All these examples address one problem: how to make agents continuously produce high-quality code. But Anthropic’s blog revealed a deeper issue.

Models don’t evaluate their own work.

Anthropic engineers found that when you ask an agent to assess what it just wrote, it confidently declares it’s great—even when humans clearly see quality issues. This problem is especially severe in subjective tasks (frontend design, page aesthetics) with no binary right/wrong standards. But even in tasks with objective criteria, agent self-judgment fails.

This is a foundational reason harness must exist. Models have sufficient capability but lack accurate self-awareness of that capability.

Anthropic’s solution borrowed from GANs (Generative Adversarial Networks):

Split generation and evaluation into two independent agents. A generator writes, an evaluator judges. The evaluator doesn’t score screenshots—it uses Playwright to actually click pages, query APIs, check database state like a real QA, then provides feedback.

But evaluators aren’t inherently reliable. Out-of-the-box Claude is a poor QA agent. Early evaluators would find issues, then convince themselves they weren’t big problems and approve anyway. They also tended toward surface testing without exploring edge cases.

Multiple rounds of calibration were needed to align evaluator strictness with human expectations.

The key insight: Making an independent evaluator strict is far easier than teaching a generator self-criticism. That’s the value of separation.

In full-stack development tests, they used a 3-agent architecture:

Planner expands one-sentence requirements into complete product specs
Generator implements features by sprint
Evaluator does real acceptance testing after each sprint

Solo mode output looked fine but core functionality was broken—UI completely non-responsive.

Harness mode produced a more complete editor, playable game, rough but functional physics engine with working core interactions.

The evaluator’s bug reports pinpointed exactly which line of code had what problem.

Survey Data

I used deep research to investigate many companies’ published harness-generated value:

LangChain: Same model (gpt-5.2-codex), only changed harness → Terminal Bench 2.0 score from 52.8% to 66.5%, ranking jumped from outside top 30 to top 5
Nate B Jones: Same model, different harness → success rate 42% vs 78%, shell change equivalent to one model generation improvement
Anthropic: Solo mode 20min/$9 (broken core functionality), harness mode 6hr/$200 (functional). Later optimized harness: ~4hr/$125 for complete build
Terminal Bench 2.0: Claude Opus 4.6 ranked #33 with Claude Code Harness, #5 with different harness—same model, 28-position gap
Pi Research: One afternoon, only modified harness, improved 15 different LLM coding abilities. Paper title: “Improving 15 LLMs at Coding in One Afternoon. Only the Harness Changed”

All point to one conclusion: at this inflection point, optimizing the shell around models may yield higher returns than waiting for next-gen models.

The Counterargument

Not everyone agrees.

OpenAI’s Noam Brown said directly in a Latent Space interview:

“Harness is like a crutch. We will eventually transcend it.”

His reasoning: Before reasoning models appeared, developers built massive agentic systems on GPT-4o to simulate reasoning—routers, orchestrators, multi-agent collaboration. Then reasoning models launched, and overnight these systems became unnecessary. In many scenarios, they actually made results worse.

His prediction: OpenAI is moving toward a unified model future. You shouldn’t need routers on top of models. His advice to developers:

“Don’t spend six months building something that might be obsolete in six months.”

METR data also dealt a blow to the Harness camp.

They recruited 4 active maintainers from scikit-learn, Sphinx, and pytest projects to review 296 AI-generated PRs. Result: maintainer merge rate was only about half the automated scoring pass rate.

Automated scoring claimed agents could independently complete ~50-minute tasks, but maintainers actually merged PRs corresponding to only ~8-minute task ranges—7x capability overestimation.

Latent Space named this faction the Bitter Lesson camp, after Rich Sutton’s famous AI essay: don’t invest too heavily in engineering tricks—compute growth will eventually steamroll everything.

Final Thoughts

OpenAI’s Codex team no longer writes code—they write architectural rules, linter configs, and AGENTS.md. Stripe engineers no longer write code—they write Blueprint orchestration and CI rate-limiting policies. Anthropic engineers no longer write code—they write evaluator scoring standards and calibration logic.

Writing code is becoming low-cost. Designing systems that let agents continuously, stably, and high-quality write code is the truly expensive part.

And that system itself isn’t one-and-done. Each model generation requires re-examining which constraints still work, which should be removed, and which new spaces opened up.

The truly scarce capability isn’t inside models—it’s outside them. And it needs rewriting every few months.

Anthropic’s blog post ends with this line:

“The interesting space of harness combinations doesn’t shrink as models improve. It just moves. And what AI engineers really need to do is continuously find the next effective combination.”

This is Harness Engineering in 2026: not a static architecture, but a living, evolving practice that dances with model capabilities. The question isn’t whether to build harnesses—it’s how quickly you can adapt them to the models you have today.

References:

Anthropic Blog: “Harness Engineering: Stop Waiting for Next-Gen Models, Build Today”
Anthropic Blog: “What Can I Stop Doing? Three Patterns for Modern Harness Design”
Source articles (Chinese): 探索AGI WeChat blog