AI Self EVOLUTION (Meta Harness)
TL;DR
Meta Harness treats the harness as the real leverage point — Matthew Berman’s core claim is that the code around a model like Claude or GPT matters as much as the weights themselves, and the paper shows changing only the harness can create a 6x performance gap on the same benchmark.
The big idea is self-improving harnesses, not just self-improving models — building on Andrej Karpathy’s Auto Research and Google’s AlphaEvolve, Meta Harness repeatedly proposes, tests, logs, and rewrites its own agent framework so the system around the model gets better without a human manually tuning it.
The system works because it retrieves experience instead of cramming everything into context — rather than compressing millions of tokens of prior code, traces, prompts, and failures into one prompt, the coding-agent proposer uses tools like a developer would, inspecting only the files and artifacts it needs.
On text classification, Meta Harness beat prior methods while using far fewer tokens — it posted the best average score at 48 versus 40.9 for ACE, crushed the law benchmark with 45 versus 29, and used just 11.4 context units compared with 28.5 for ACE and 50.8 for MCE.
The gains generalized beyond the training tasks and even helped IMO-style math reasoning — after optimizing on three tasks, the discovered harness transferred to nine unseen datasets and still led on average at 73.1 versus 70.2 for ACE; on held-out math models it improved scores by an average of 4.7 points through retrieval of reusable proof patterns.
On Terminal Bench 2, evolved harnesses were competitive with or better than hand-built ones — with Claude Opus 4.6, Meta Harness reached 76.4, higher than every listed handwritten system except Forge Code, and with Claude Haiku 4.5 it led outright at 37.6 versus Goose at 35.5.
The Breakdown
The harness is the thing everyone underrates
Berman opens with a strong thesis: all software will become self-evolving software, and the catalyst is the harness — the code wrapped around models like Claude, GPT, or Gemini that gives them memory, retrieval, tool use, and long-running autonomy. His framing is simple and sticky: the model weights are just the engine; the harness is the rest of the car that actually gets you somewhere.
From Karpathy’s Auto Research to harnesses improving themselves
He connects the paper to Andrej Karpathy’s Auto Research project, which already has 61,000 stars and lets a model run experiments overnight to improve how it trains a GPT-2-level model. His point is that we’ve already seen AI improve pieces of software, and now the same recursive loop is being aimed at the harness itself — the layer he thinks is the missing ingredient for AGI-like performance today.
Why old prompt optimization breaks on real harnesses
Berman slows down here to explain why this isn’t just better prompt engineering. Harnesses operate over long horizons, with memory, retrieval decisions, tool calls, state updates, and delayed downstream effects, so reducing all that to a single scalar score or a tiny summarized prompt loses the signal you actually need to improve the system.
Let the model decide what context it needs
One of the paper’s key ideas, and one Berman clearly loves, is adaptive retrieval: don’t monolithically pack everything into a prompt upfront. Give the proposer access to the codebase and prior runs, then let it inspect what matters — the same way Cursor or Claude Code explores files with grep and cat instead of pretending a million-token codebase can fit cleanly into one context window.
The proposer is itself a coding agent
Meta Harness uses a coding-agent proposer that can modify code, inspect prior harnesses, and decide whether to make a small patch or a larger rewrite. Berman emphasizes how “meta” this is: a harness around a model is now improving another harness, and because the outer loop is minimal, the whole thing gets better automatically as coding agents themselves improve.
The text classification results are the first real wow moment
On text classification, Meta Harness beat the field on average with a score of 48 versus 40.9 for ACE, and on the law benchmark it jumped to 45 while the next-best methods topped out at 29. What really gets Berman excited is efficiency: Meta Harness used 11.4 context versus 28.5 for ACE and 50.8 for MCE, so it was not just better but cheaper.
It wasn’t just overfitting — it transferred and helped math too
The authors tested whether the discovered harness only memorized the initial tasks, and it didn’t: across nine unseen datasets, it still led on average at 73.1 compared with 70.2 for ACE. Then on IMO-style mathematical reasoning, Berman highlights the elegant explanation for retrieval helping: hard proofs often share reusable patterns, so pulling in prior solutions can genuinely improve reasoning on a new problem.
Terminal Bench 2 and the “bitter lesson” ending
The finale is Terminal Bench 2, where Meta Harness with Claude Opus 4.6 scored 76.4, beating every benchmarked handwritten harness except Forge Code, and with Claude Haiku 4.5 it led outright at 37.6 versus Goose at 35.5. Berman ties it to the bitter lesson: hand-written heuristics eventually lose to end-to-end learned systems, so if AI can evolve its own harness today, his broader takeaway is that eventually all code, automations, and software loops will be self-improving too.