An AI state of the union: We’ve passed the inflection point & dark factories are coming
TL;DR
Coding crossed a real threshold in late 2025 — Simon Willison argues GPT-5.1 and Claude Opus 4.5 were only incrementally better on paper, but in practice they pushed coding agents from “mostly works if you babysit it” to “usually does what you asked,” which is why engineers came back from the holidays realizing they could generate 10,000 lines a day.
The big shift isn’t vibe coding — it’s “agentic engineering” — Willison draws a hard line between casual prototyping for yourself and professional software work, saying real production use now depends on orchestrating agents, tests, reviews, and judgment rather than just prompting and hoping.
Dark factories are the next frontier, and StrongDM is already prototyping them — He highlights StrongDM’s “nobody writes code, nobody reads code” experiments, including swarms of AI QA testers and simulated versions of Slack, Jira, and Okta, spending roughly $10,000/day on tokens to stress-test security software around the clock.
AI makes senior engineers more powerful, but may squeeze the middle — Willison says his 25 years of experience are now an amplifier for agents, while Thoughtworks’ recent offsite suggested beginners benefit from AI-assisted onboarding and seniors benefit from leverage, leaving mid-career engineers as the most exposed group.
The bottleneck has moved from writing code to proving what should exist — Because prototypes are now nearly free, Willison says product work shifts toward trying multiple directions fast and then doing old-school human usability testing, since AI can generate ideas and UI mocks but still can’t credibly stand in for real users.
He thinks AI is drifting toward a “Challenger disaster” moment on security — Drawing on Diane Vaughan’s “normalization of deviance,” Willison warns that prompt injection remains fundamentally unsolved, and the industry keeps getting more comfortable deploying unsafe agent patterns simply because a catastrophic headline-grabbing failure hasn’t happened yet.
The Breakdown
The inflection point where coding agents got actually useful
Willison opens with the claim that 2025 was the year Anthropic and OpenAI realized “code is the application.” Claude Code, reasoning models, and the late-year jump to GPT-5.1 and Claude Opus 4.5 created a threshold change: agents stopped being buggy novelty tools and started returning software that usually works if you specify it well. That’s why engineers came back in January and February with the same stunned reaction: this stuff suddenly works now.
From vibe coding to real software engineering with agents
He loves vibe coding in Andrej Karpathy’s original sense — making prototypes without caring about the code — especially for personal tools and demos. But he’s adamant that this isn’t the same as professional work, where bugs can hurt other people and experience still matters enormously. His preferred term is “agentic engineering,” because the interesting discipline now is learning how experts use agents to build software that’s not just faster, but actually better.
Dark factories: nobody writes code, nobody reads code
The most sci-fi section is the “dark factory” idea, borrowed from automated factories where you can turn the lights off because no humans are on the floor. Willison points to StrongDM, which experimented with rules like “nobody types code” and then “nobody reads code,” replacing manual review with agent swarms that simulate QA. The wildest example: thousands of simulated employees in fake Slack channels asking for Jira or Okta access, with the company reportedly spending around $10,000 a day on tokens to test security workflows nonstop.
Security gets weirder as agents become credible attackers
He says the code-writing boom is now spilling into security, where models have become good enough to act like real penetration testers. Anthropic and OpenAI even have restricted-access security models that won’t be released publicly, and he cites Anthropic helping Mozilla identify around 100 potential Firefox vulnerabilities. The flip side is grimly familiar: maintainers now get polished-looking AI vulnerability reports from people who haven’t verified anything, which wastes everyone’s time.
The bottlenecks moved — and humans are still in the loop
Because code is cheap now, the hard part is no longer implementation; it’s figuring out what to build and how to evaluate it. Willison says he often prototypes three versions of a feature because it’s so fast, but insists the real test is still human usability testing on Zoom, not simulated users clicking around. He compares AI brainstorming to the first two-thirds of a whiteboard session: great for exhausting obvious ideas, and occasionally useful for creating weird sparks when you force strange combinations like “SaaS marketing inspired by marine biology.”
Why the best AI users look exhausted instead of relaxed
One of the most human parts of the conversation is Willison admitting that he can run four agents in parallel and be mentally wiped out by 11 a.m. He says using agents well takes “every inch” of his 25 years of engineering experience, and that many early adopters are losing sleep because it feels like their agents could always be doing more. At the same time, he’s clearly having a blast: he raised his own ambition this year, cleared out side-project backlogs, and describes the whole moment as both exhilarating and vaguely addictive.
The new rules of agentic engineering
In the practical section, Willison shares the patterns he thinks matter most: tests are non-negotiable, red/green TDD works especially well with agents, and a good project template can steer an agent better than pages of instructions. He also talks about “hoarding” solved problems — storing tools, prototypes, and research repos in GitHub so agents can recombine them later. The throughline is simple: if code is cheap, the edge moves to taste, scaffolding, verification, and having a deep backlog of things you know work.
Prompt injection, the lethal trifecta, and the coming AI disaster
Willison closes on the risk he’s most worried about: prompt injection, the class of attacks he named in 2022, where malicious text hijacks an LLM-powered system. His sharper framing is the “lethal trifecta”: an agent has access to private data, can ingest attacker-controlled instructions, and has a channel to exfiltrate information. He compares the current industry mood to the Challenger shuttle disaster — everyone knows the O-rings are bad, but repeated near-misses make institutions feel safer than they are — and predicts AI is heading toward that same kind of reckoning.