The AI Factory Thesis

Playbook

March 26, 2026

The AI Factory Thesis

StrongDM's dark factory pattern points to a real shift in software work. Humans stop acting as line-by-line coders and diff reviewers. They write specs, design harnesses, shape the environment, and set control rules. Agents write the code, run the tools, fix failures, and keep going without line-by-line human review. That can work when the work is externally verifiable, the environment is instrumented, and the team treats validation as the main product. It breaks when teams skip the harness, let agents write their own exam, or use autonomy on work that lacks a clear outside view of success. (StrongDM)

StrongDM itself calls this a software factory. Dan Shapiro's label, dark factory, names the far end of the same idea: a black box that turns specs into software. Human labor shifts upward into spec writing, evaluation design, environment design, and intervention policy. StrongDM's public materials are company self-description, not an audited benchmark, but the underlying mechanics line up closely with what Anthropic, OpenAI, Cognition, and Sam Schillace have each described from their own systems. (StrongDM)

What StrongDM Means By Dark Factory

StrongDM's core formula is short: seed, validation harness, feedback loop. The seed can be a PRD, a spec, a few sentences, a screenshot, or an existing codebase. The harness must be end-to-end and close to the real environment. The feedback loop keeps sampling outputs and feeding them back until holdout scenarios pass and stay passing. Tokens are the fuel. (StrongDM)

The provocative rule set is what gets attention: code must not be written by humans, and code must not be reviewed by humans. StrongDM pairs that with a constant question, "Why am I doing this?" Dan Shapiro's gloss is useful here: if a human can say why something in a log or output looks wrong, that description can become a validation rule. The point is to force the team to convert judgment into repeatable checks. (StrongDM)

That last piece matters more than the slogan. A dark factory is not "AI does everything." It is a system where the human contribution moves out of the implementation loop and into the scaffolding around it. OpenAI describes the same move in its harness engineering work: humans stay in the loop, but at a different layer, prioritizing work, translating feedback into acceptance criteria, and validating outcomes. (OpenAI)

How The Pattern Works

1) Freeze The Target

StrongDM's seed looks a lot like OpenAI's Prompt.md and Cognition's Playbooks. Freeze goals, non-goals, constraints, deliverables, and "done when" checks before the agent starts. StrongDM says the seed can be tiny, but the strongest long-running systems turn vague intent into explicit artifacts. OpenAI's long-horizon Codex writeup uses Prompt.md to keep the agent from building something "impressive but wrong." Devin's Playbooks do the same thing for repeated tasks, with outcome, steps, postconditions, and forbidden actions. (StrongDM)

2) Hide The Exam

StrongDM learned fast that agent-written tests are not enough. Tests inside the repo can be rewritten to match the code, or the code can be bent to pass narrow checks. Their answer is scenarios: end-to-end user stories stored outside the codebase, like a holdout set. Simon Willison ties that back to Cem Kaner's scenario testing. Anthropic's eval guidance points in the same direction: start early, use 20 to 50 tasks drawn from real failures, and make success criteria unambiguous. The agent must not write its own exam. (StrongDM)

3) Grade Behavior, Not Source

StrongDM calls its grading measure satisfaction: of all observed trajectories through all scenarios, what fraction likely satisfies the user. That matters because agentic products often need more than a boolean unit test. Anthropic makes a similar point from the eval side. Agent behavior varies run to run, so teams need metrics that capture consistency as well as one-off success. StrongDM's own validation constraint makes the philosophy explicit: code is treated like an opaque model snapshot, and correctness is inferred from externally observable behavior. (StrongDM)

StrongDM also defines feedback much more broadly than CI signals. Its principles page explicitly lists traces, screen capture, conversation transcripts, incident replays, adversarial use, agentic simulation, just-in-time surveys, customer interviews, and price elasticity testing as inputs to the loop. That is a clue about the deeper shift. Once code generation is fast, product understanding, not typing, becomes the scarce resource. (StrongDM)

4) Make The World Cheap To Replay

StrongDM's sharpest move is the Digital Twin Universe. Instead of pointing agents at live Okta, Jira, Slack, Google Docs, Drive, and Sheets, they built behavioral clones of those systems and tested them at high volume. That lets them run thousands of scenarios per hour, hit edge cases that would be dangerous against production, and avoid rate limits and API costs. That is a real engineering tactic. (StrongDM)

5) Externalize The Memory

Long-running agent systems work best when memory lives on disk, in version control, and in structured state stores. StrongDM literally lists "The Filesystem" as a core technique and ships CXDB as a context store with a turn DAG, deduplication, typed views, and branching from any turn. OpenAI's harness team keeps a short AGENTS.md as a table of contents and treats a structured docs/ directory as the system of record. Anthropic's long-running harness uses a feature list JSON, git history, init scripts, and progress files so each fresh session can get its bearings. Durable artifacts beat asking the next context window to remember what the last one meant. (StrongDM)

StrongDM's "Pyramid Summaries" idea matters here too. It wants summaries that compress context without losing the ability to expand back to full detail. That lines up with OpenAI's compaction guidance and Anthropic's handoff-oriented harnesses. Good memory systems summarize to keep work moving, but keep a reversible trail when the agent needs to drill back into specifics. Lossy summaries create slow, hard-to-debug amnesia. (StrongDM)

6) Run In Phases, Not One Endless Chat

StrongDM's "Shift Work" idea, its Attractor graph runner, Anthropic's initializer-plus-coding-agent split, and OpenAI's long-horizon loop all converge on the same lesson: autonomy has to survive handoffs. Attractor checkpoints after each node and can resume after a crash. Anthropic asks one agent to set up the environment, then later agents to make one feature's worth of progress and leave clear artifacts. OpenAI describes the durable loop as plan, edit, run tools, observe, repair, update docs, repeat. Long runs behave more like relay shifts than uninterrupted cognition. (StrongDM)

7) Constrain The Worker

The best teams do not hand one giant agent every tool and every permission. StrongDM's coding-agent spec emphasizes provider-aligned toolsets and real-time steering. Anthropic lets teams scope MCP servers to a subagent, preload only the skills it needs, and give it persistent memory at the right scope. OpenAI says to add tools only when they remove a real manual loop and to maximize a single agent before reaching for multi-agent orchestration. Tool sprawl makes agents slower, dumber, and harder to debug. (GitHub)

8) Transplant Proven Patterns

Some of StrongDM's side techniques are more important than they look. "Gene Transfusion" means showing the agent a working exemplar so it can transplant a pattern into a new repo. "Semport" means porting across languages or frameworks while preserving intent. Those ideas explain why some teams get real reuse out of agents. The fastest path is often not invention. It is carrying a proven pattern into a new setting under clear constraints. (StrongDM)

Why It Works

The pattern works best in software because software can often be judged externally. Anthropic's eval guide says coding agents fit deterministic graders unusually well: does the code run and do the tests pass in a stable environment? Its autonomy-in-practice study also found software engineering accounts for nearly 50% of agentic activity on Anthropic's public API. That likely reflects the same thing StrongDM is exploiting: code is one of the few high-value domains where outputs can be exercised, measured, and rolled back with relatively clean feedback loops. (Anthropic)

It also works when the architecture has real seams. Schillace says the compounding teams he has seen are small, run 5 to 10 processes in parallel, and care deeply about modular boundaries. Anthropic says multi-agent systems shine when tasks have independent branches, but many coding tasks do not. Cognition goes further and argues that default multi-agent splits are fragile because agents carry implicit decisions that clash when they cannot see the same context. Strong autonomy likes isolated pieces, shared context, and narrow contracts. It does poorly in a bowl of spaghetti. (Sunday Letters)

The economics changed too. StrongDM openly says tokens are fuel and floats a deliberately aggressive heuristic of $1,000 a day in tokens per human engineer. Schillace reports teams already spending hundreds per day, with one aiming for $1,000. Anthropic's research system found agents use about 4x the tokens of chat, and multi-agent systems about 15x. The good teams accept that bill because the harness converts extra tokens into more validated work. Bad teams burn the same tokens inside loose loops and call it progress. (StrongDM)

One more subtle point: the strongest factories are usually model portfolios, not one model with a crown. StrongDM's March 12, 2026 weather report says it uses gpt-5.4 for planning and architectural critique, gpt-5.3-codex as the default implementation model, consensus(opus-4.6, gpt-5.4) for sprint planning, and Opus 4.6 for QA orchestration and DevOps tasks. Planning, coding, critique, and QA are different jobs. The systems that respect that usually look better than the ones asking one model to be every specialist at once. (StrongDM)

Where It Breaks

Most failures start with a weak harness. StrongDM ran into reward hacking almost immediately. Anthropic warns that without evals teams get stuck in reactive loops, fixing one production failure and creating another. Long-running harnesses also fail when agents mark features complete too early, which is why Anthropic stores feature requirements in JSON and only flips a pass field after careful testing. If the agent can redefine success, it will. (StrongDM)

Teams also reach for multi-agent architectures too early because they look advanced. Anthropic says to start with the simplest workable system and maximize a single agent first. OpenAI says much the same. Cognition is blunter: share context and full traces, because subagents acting on partial context make conflicting implicit decisions. Anthropic's own research team saw early failures like spawning 50 subagents for simple queries and searching forever for sources that did not exist. Parallelism helps only when the work really branches. (Anthropic)

Long-running autonomy degrades when critical decisions live in Slack threads, tribal memory, or giant prompts. OpenAI learned that one big AGENTS.md failed, so it turned the file into a map and moved real knowledge into docs/. Anthropic's harness persists progress in files and git. StrongDM treats the filesystem itself as memory and offers CXDB for structured traces. Good systems keep the durable context where the agent can read it. The rest becomes invisible work. (OpenAI)

Agents are still much better at well-scoped work than moving targets. Cognition says Devin excels when requirements are clear up front and verifiable, roughly the sort of task that would take a junior engineer 4 to 8 hours. It performs worse when the human keeps changing requirements mid-task. Anthropic's long-running harness reached the same conclusion from another angle: make one feature's worth of progress at a time, commit it, write a progress update, then continue. Long runs need scope control as much as model quality. (Cognition)

Per-action approval feels safe, but it often produces the opposite. Anthropic found sandboxing reduced permission prompts by 84% and argued that constant clicking creates approval fatigue. Its autonomy-in-practice study found experienced users supervise by monitoring and interrupting when needed, not by approving every step. OpenAI's governance guide lands in the same place: use risk-proportionate controls and human intervention for high-risk actions, not blanket friction everywhere. Good oversight is selective, legible, and tied to actual risk. (Anthropic)

Autonomous systems also fail for banal reasons: context truncation, flaky tools, broken dev servers, crashed processes, stale state. Anthropic says durable long-running work remains an open problem and answers it with initializer agents, progress files, and clean-state expectations. Attractor checkpoints after every node. OpenAI's long-horizon guidance emphasizes session controls like resume, fork, and compact. If the system cannot recover from an interruption, it is not autonomous in any serious sense. (Anthropic)

The dark-factory pattern belongs first in work where failures are easy to catch and easy to undo. OpenAI's governance guide recommends escalating controls by risk tier, with human-in-the-loop and isolated environments for high-risk cases. Anthropic's usage study says most agent actions today are low-risk and reversible, even as experimentation spreads into finance, healthcare, and cybersecurity. The harder the failure is to detect or unwind, the less you should rely on pure autonomy. (OpenAI Developers)

How Other Teams Frame The Same Idea

Across companies, the same structure keeps showing up under different names. Anthropic talks about simple patterns first, strong evals early, and long-running harnesses built around feature lists, progress files, and clean handoffs. OpenAI talks about harness engineering, agent-legible repos, custom linters and "taste invariants," durable project memory, skills, and automations. Cognition talks about Playbooks, Knowledge, verification mechanisms, and the discipline of well-scoped tasks. Schillace describes small compounding teams building their own frameworks around models, then watching human attention become the bottleneck. The names vary. The stack does not. (Anthropic)

The biggest disagreement is where to place the default. StrongDM and Shapiro are arguing from the frontier, where zero hand-written code and zero traditional review are design goals. Anthropic and OpenAI give more conservative advice: start simple, keep one agent as long as you can, add structure only when the task demands it. Cognition agrees on ambition but is skeptical of multi-agent theater. That tension is healthy. It keeps teams from confusing the far frontier with the right first step. (StrongDM)

For most teams, turning off human code review on day 1 is a category error. StrongDM can even attempt it because it invested heavily in scenario sets, twins, context infrastructure, and orchestration. OpenAI's own Codex guidance still recommends asking Codex to review its changes, run checks, and confirm behavior, and at OpenAI Codex reviews 100% of PRs. The right way to read StrongDM is as an end state for teams with unusually strong harnesses, not as permission to skip rigor. (StrongDM)

A Practical Playbook

Most teams do not need StrongDM's full Digital Twin Universe on day 1. They need a disciplined workshop that can grow into more autonomy. A sane path looks like this. (Anthropic)

1) Pick One Narrow Workflow

Start with a narrow, high-volume task that already has a clear answer and high repetition: dependency upgrades, vulnerability fixes, migrations, test backfills, log triage, release notes. Cognition says Devin shines on clear 4 to 8 hour junior-level tasks with verifiable outcomes. OpenAI recommends agents where rule-based automation falls short, not everywhere at once. (Cognition)

2) Freeze The Target In A Spec File

Write one spec file before the run. Include goals, non-goals, hard constraints, deliverables, and done-when checks. OpenAI's Prompt.md pattern is a good template, and StrongDM's seed concept is the same idea in shorter form. (OpenAI Developers)

3) Build A Hidden Acceptance Set

Keep a protected set of end-to-end scenarios outside the editable code path, or at least in a location the agent cannot casually rewrite. If you cannot do that yet, use Anthropic's feature-list JSON pattern and only let the agent flip pass/fail status after end-to-end verification. The goal is to stop the agent from grading itself. (StrongDM)

4) Keep Repo Memory Layered

Use a short root guide as a map, not an encyclopedia. Put durable design decisions in docs/, keep a progress log, commit often, and let the next session read the log before it acts. OpenAI, Anthropic, and StrongDM all converged on this layered memory pattern from different directions. (OpenAI)

5) Turn Repeated Prompts Into Skills

If you keep correcting the same workflow, package it. OpenAI says recurring work should become a skill once the prompt repeats. Anthropic's skills live in SKILL.md and support extra files so the main instructions stay short. For production use, write the skill description like routing logic: when to use it, when not to use it, what inputs it expects, what outputs it should produce, and what edge cases should block it. Negative examples matter. (OpenAI Developers)

6) Constrain Execution

Run the agent in an isolated worktree or container. Scope tools and network access tightly. Anthropic's sandboxing case is practical: filesystem plus network isolation cut permission prompts while raising safety. OpenAI says add tools only when they remove a real manual loop. (Anthropic)

7) Parallelize Only Along Clean Seams

Use background agents or subagents for independent units, not for tangled shared-state work. Anthropic's /batch-style and research patterns, OpenAI's worktrees and subagents, and Cognition's separate sessions all assume that each unit has a clean boundary. If 2 agents need the same evolving context, keep it single-threaded until you can split the interface cleanly. (Claude API Docs)

8) Automate Maintenance Last

Once the workflow is stable, schedule it. OpenAI draws a useful line: skills define the method, automations define the schedule. Its harness team also runs background Codex tasks that scan for deviations and open refactoring PRs. That is the maintenance version of a factory: not just building features, but continuously cleaning drift before it spreads. (OpenAI Developers)

What Separates The Good From The Bad

The good teams treat the harness as the product. They invest in external validation, repository legibility, clean modular seams, explicit skills, and durable state. They use human judgment to design the system and to set escalation points. The weak teams spend their time polishing prompts, adding more agents, and staring at diffs that should have become checks. StrongDM's own techniques list is revealed here: digital twins, concrete examples, filesystem memory, shift work, semantic porting, pyramid summaries. Almost all of it is scaffolding. Very little is clever prose in a prompt box. (StrongDM)

A useful test is whether your system answers 4 questions cleanly. What is the target? What is the hidden acceptance bar? What persists across runs? What actions need containment or escalation? If those answers are fuzzy, more autonomy usually makes the problem worse. If those answers are crisp, autonomy starts compounding. (StrongDM)

Judgment

StrongDM's dark-factory pattern matters because it flips the bottleneck. The scarce thing is no longer code writing. It is specification quality, harness quality, simulation quality, and system design. That is why the strongest public examples look less like magical copilots and more like carefully engineered factories around models. (StrongDM)

Start with a well-instrumented workshop. Graduate to dark-factory behavior only after you can define success from outside the code, replay that success cheaply, and stop or redirect the agent safely at clear boundaries. Build validation first. Autonomy gets much better once it has something honest to obey. (Anthropic)