Your skills.md have an expiration date. Here's what do to about it.
TL;DR
The leaked Claude Code source is the wrong layer to obsess over — Jo Van Eyck argues the real leverage is not Anthropic’s ~50K-line wrapper but the outer “user harness” you control: skills, agent.md files, MCP servers, scripts, tests, and rules.
Birgitta Buckler’s “harness engineering” frame gives software engineers a cleaner mental model — the key split is feedforward mechanisms (what you load into the agent loop up front) versus feedback mechanisms (tests, static analysis, logs, code review agents) that let the model self-correct.
The missing dimension is time — Jo’s core addition to Buckler’s framework is that a great harness today may be stale tomorrow, because model improvements can make once-essential scaffolding unnecessary.
Evals are how you tell whether harness changes actually help — instead of guessing, he suggests running repeatable experiments, adding or removing skills and MCP servers, and measuring whether success rates improve as models change.
Small teams can get started with “vibes-based” experimentation; larger orgs should build eval infrastructure — Jo says solo devs don’t need heavy systems, but companies taking agents seriously should treat harness tuning like ML experimentation, potentially even with nightly autonomous optimization jobs.
A concrete example: he deleted a pre-commit hook that stripped model-written comments — a year ago models ignored his “no comments” rule and needed cleanup scripts; now Opus follows the instruction reliably enough that the workaround is gone.
The Breakdown
Stop gawking at the Claude Code leak
Jo opens with a shrug: Claude Code’s leaked source is “a wrapper around a model,” so the drama is overblown. His point is that software engineers are staring at the wrong abstraction layer — not the leaked internals, but the layer above them where users actually create leverage.
The onion model: LLM in the center, harness on the outside
Borrowing from Birgitta Buckler’s Thoughtworks article, he walks through the “onion”: the LLM at the core, provider-built agent harnesses like Claude Code and Codex CLI in the middle, and the user-controlled harness on the outside. He underlines how little magic there is in the middle layer — Claude Code is only about 50K lines, and he built a lightweight version himself in 200 lines last summer.
Feedforward and feedback: the two ways you shape an agent
Jo likes Buckler’s terminology because it maps neatly to when humans intervene in the loop. Feedforward is everything that improves output before the agent gets to work — skill files, agent.md, MCP servers, language servers, scripts, tools, and type systems — while feedback is what lets the agent catch and fix its own mistakes through unit tests, static analysis, logs, and review agents.
Inferential vs. computational, with the human doing the steering
He then adds the second axis: inferential versus computational. Computational is classic CPU work; inferential is LLM-backed reasoning, and together they describe the machinery surrounding the coding agent while the human steers synchronously in-session or asynchronously by refining the harness.
The framework is good, but it forgets the arrow of time
Jo’s main critique is that the model is too static: it captures your harness today, not how it evolves. He compares this to the “bitter lesson” — we build clever compensations for current model weaknesses, but after a few model iterations, some of those compensations may simply disappear as the models get better.
Build for today’s models, but expect tomorrow’s
He pushes back a bit on Boris Cherny’s advice to build for tomorrow’s models, noting that most developers still have to ship with Sonnet and Opus today. His version is more grounded: build for the models you have now, but assume your harness will need to change as future models arrive.
Evals are the missing method for tuning the harness
To decide whether adding or removing a skill or MCP server helps, Jo says the answer is straightforward: evals. His example is simple — run repeated tasks with the full harness, then pull pieces out and measure whether success rates go up or down — which lets you see both what works now and what becomes obsolete after the next model drop.
One year later, the workaround disappeared
His favorite concrete example is a pre-commit hook he used to strip code comments because models kept “vomiting” comments into generated code even after explicit instructions not to. A year later, that hack is gone from his harness because Opus now obeys the rule reliably if he manages the context window well — exactly the kind of expiration date he wants people to notice.