Alea

Jeffrey Emanuel’s Agent Flywheel

Playbook

April 3, 2026

Jeffrey Emanuel’s Agent Flywheel

Agent Flywheel is a planning-first method for turning a fuzzy software goal into a coordinated swarm of coding agents. Its core idea is not “use more agents.” It is “make the repo legible to agents through durable artifacts.”

Jeffrey Emanuel’s Agent Flywheel is an operating protocol for multi-agent coding. The project state lives outside chat, in a large markdown plan, a graph of dependency-linked “beads” tasks, an AGENTS.md operating manual, and an email-like coordination layer. The goal is to let several agents work in parallel without improvising architecture from a narrow local slice of the codebase. (Agent Flywheel)

Emanuel’s public CASS Memory serves as an example where a 5,500-line plan became 347 beads, 11,000 lines of code, 204 commits, and a working system in about 5 hours with 25 agents. It is, however, fair to say the evidence base is still thin. Most hard numbers come from Emanuel’s own guide and tool pages rather than independent benchmarks.

Treat Flywheel as a software delivery method, not as proof that coding agents have solved enterprise-grade software engineering. The strongest ideas here are portable. The most aggressive defaults, especially direct commits to main in one shared workspace, will be too hot for many teams.

The Core Move

Flywheel’s central insight is that reasoning gets cheaper as you move up the artifact stack. In Emanuel’s framing, plan space is where the whole system still fits in context, bead space packages that plan into executable memory, and code space is local implementation and verification. His argument revolves around the notion of catching system mistakes in plan space, where fixes are cheap, before they harden into dependencies or code churn.

Jeffrey Emanuel
Jeffrey Emanuel@doodlestein

Before you burn up a lot of tokens with a big agent swarm on a new project, the old woodworking maxim of "Measure twice, cut once!" is worth revising as "Check your beads N times, implement once," where N is basically as many as you can stomach. I've found that you continue to get more and more improvements, even if they're subtle, the more times you run this in a row with Opus 4.5 (note that the following prompt is only for use AFTER you've already turned your initial markdown plan into beads using the other prompt I gave recently in my recent very long post about my workflows): "Reread AGENTS dot md so it's still fresh in your mind. Check over each bead super carefully-- are you sure it makes sense? Is it optimal? Could we change anything to make the system work better for users? If so, revise the beads. It's a lot easier and faster to operate in "plan space" before we start implementing these things! DO NOT OVERSIMPLIFY THINGS! DO NOT LOSE ANY FEATURES OR FUNCTIONALITY! Also, make sure that as part of these beads, we include comprehensive unit tests and e2e test scripts with great, detailed logging so we can be sure that everything is working perfectly after implementation. Remember to ONLY use the `bd` tool to create and modify the beads and to add the dependencies to beads. Use ultrathink." I used to only run that once or twice before starting implementation, but I experimented recently with running it 6+ times, and it kept making useful refinements. If it starts to flatline in terms of incremental improvements to the beads, you might try starting a brand new CC session, starting it with: "First read ALL of the AGENTS dot md file and README dot md file super carefully and understand ALL of both! Then use your code investigation agent mode to fully understand the code, and technical architecture and purpose of the project. Use ultrathink." And then following up with the same prompt as shown above, but prefaced with: "We recently transformed a markdown plan file into a bunch of new beads. I want you to very carefully review and analyze these using `bd` and `bv`." The more complex and intricate your markdown plan is, the more relevant this technique is. If you have a small, trivial plan and a very simple project, this is obviously overkill. But in that case, you will likely see little in the way of incremental gains/changes with each round, so it should be fairly obvious when it's time to stop. Just remember: planning tokens are a lot fewer and cheaper than implementation tokens. Even a very big, complex markdown plan is shorter than a few substantive code files, let alone a whole project. And the models are far smarter when reasoning about a plan that is very detailed and fleshed out but still trivially small enough to easily fit within their context window (this is really the key insight behind my obsessive focus on planning and why I spent 80%+ of my time on that part). And if you lean on GPT Pro with Extended Reasoning in the web app for the initial planning as I strongly advocate (that is, to create and improve your markdown plan that you eventually turn into beads), you basically get those on an all-you-can-eat basis with a Pro plan, so take full advantage of that! No other model can touch Pro on the web when it's dealing with input that easily fits into its context window. It's truly unique. Now, you can still get a lot of extra mileage by blending in smart ideas from Gemini3 in the web app with Deep Think enabled, or from Grok4 Heavy, or Opus 4.5 in the web app, but you still want to use GPT Pro on the web as the final arbiter of what to take from which model and how to best integrate it. And since this post could still be even more comically long, I'll leave you with my prompt for integrating those competing plans into one single canonical "best of all worlds" markdown plan: "I asked 3 competing LLMs to do the exact same thing and they came up with pretty different plans which you can read below. I want you to REALLY carefully analyze their plans with an open mind and be intellectually honest about what they did that's better than your plan. Then I want you to come up with the best possible revisions to your plan (you should simply update your existing document for your original plan with the revisions) that artfully and skillfully blends the "best of all worlds" to create a true, ultimate, superior hybrid version of the plan that best achieves our stated goals and will work the best in real-world practice to solve the problems we are facing and our overarching goals while ensuring the extreme success of the enterprise as best as possible; you should provide me with a complete series of git-diff style changes to your original plan to turn it into the new, enhanced, much longer and detailed plan that integrates the best of all the plans with every good idea included (you don't need to mention which ideas came from which models in the final revised enhanced plan):" (Hell, one more prompt for kicks; I use this one to iteratively improve an existing markdown plan): "Carefully review this entire plan for me and come up with your best revisions in terms of better architecture, new features, changed features, etc. to make it better, more robust/reliable, more performant, more compelling/useful, etc. For each proposed change, give me your detailed analysis and rationale/justification for why it would make the project better along with the git-diff style change versus the original plan shown below:"

Jan 6
View on

The planning phase is unusually aggressive. The guide recommends asking several frontier models for competing plans, synthesizing them into one hybrid, then running fresh-round revision passes until the remaining suggestions shrink to edge cases and cleanup. The named models will change. The durable idea is the protocol: get diverse proposals, force explicit rationales, and settle architecture before the swarm burns implementation tokens.

Then the method turns prose into beads. A bead is a self-contained work unit with context, dependencies, completion criteria, and test obligations. Good beads are rich enough that a fresh agent can execute them without reopening the full plan. The guide says complex projects can generate 200 to 500 beads, which tells you how much of the method’s effort sits in translation, not coding.

Jeffrey Emanuel
Jeffrey Emanuel@doodlestein

Before you burn up a lot of tokens with a big agent swarm on a new project, the old woodworking maxim of "Measure twice, cut once!" is worth revising as "Check your beads N times, implement once," where N is basically as many as you can stomach. I've found that you continue to get more and more improvements, even if they're subtle, the more times you run this in a row with Opus 4.5 (note that the following prompt is only for use AFTER you've already turned your initial markdown plan into beads using the other prompt I gave recently in my recent very long post about my workflows): "Reread AGENTS dot md so it's still fresh in your mind. Check over each bead super carefully-- are you sure it makes sense? Is it optimal? Could we change anything to make the system work better for users? If so, revise the beads. It's a lot easier and faster to operate in "plan space" before we start implementing these things! DO NOT OVERSIMPLIFY THINGS! DO NOT LOSE ANY FEATURES OR FUNCTIONALITY! Also, make sure that as part of these beads, we include comprehensive unit tests and e2e test scripts with great, detailed logging so we can be sure that everything is working perfectly after implementation. Remember to ONLY use the `bd` tool to create and modify the beads and to add the dependencies to beads. Use ultrathink." I used to only run that once or twice before starting implementation, but I experimented recently with running it 6+ times, and it kept making useful refinements. If it starts to flatline in terms of incremental improvements to the beads, you might try starting a brand new CC session, starting it with: "First read ALL of the AGENTS dot md file and README dot md file super carefully and understand ALL of both! Then use your code investigation agent mode to fully understand the code, and technical architecture and purpose of the project. Use ultrathink." And then following up with the same prompt as shown above, but prefaced with: "We recently transformed a markdown plan file into a bunch of new beads. I want you to very carefully review and analyze these using `bd` and `bv`." The more complex and intricate your markdown plan is, the more relevant this technique is. If you have a small, trivial plan and a very simple project, this is obviously overkill. But in that case, you will likely see little in the way of incremental gains/changes with each round, so it should be fairly obvious when it's time to stop. Just remember: planning tokens are a lot fewer and cheaper than implementation tokens. Even a very big, complex markdown plan is shorter than a few substantive code files, let alone a whole project. And the models are far smarter when reasoning about a plan that is very detailed and fleshed out but still trivially small enough to easily fit within their context window (this is really the key insight behind my obsessive focus on planning and why I spent 80%+ of my time on that part). And if you lean on GPT Pro with Extended Reasoning in the web app for the initial planning as I strongly advocate (that is, to create and improve your markdown plan that you eventually turn into beads), you basically get those on an all-you-can-eat basis with a Pro plan, so take full advantage of that! No other model can touch Pro on the web when it's dealing with input that easily fits into its context window. It's truly unique. Now, you can still get a lot of extra mileage by blending in smart ideas from Gemini3 in the web app with Deep Think enabled, or from Grok4 Heavy, or Opus 4.5 in the web app, but you still want to use GPT Pro on the web as the final arbiter of what to take from which model and how to best integrate it. And since this post could still be even more comically long, I'll leave you with my prompt for integrating those competing plans into one single canonical "best of all worlds" markdown plan: "I asked 3 competing LLMs to do the exact same thing and they came up with pretty different plans which you can read below. I want you to REALLY carefully analyze their plans with an open mind and be intellectually honest about what they did that's better than your plan. Then I want you to come up with the best possible revisions to your plan (you should simply update your existing document for your original plan with the revisions) that artfully and skillfully blends the "best of all worlds" to create a true, ultimate, superior hybrid version of the plan that best achieves our stated goals and will work the best in real-world practice to solve the problems we are facing and our overarching goals while ensuring the extreme success of the enterprise as best as possible; you should provide me with a complete series of git-diff style changes to your original plan to turn it into the new, enhanced, much longer and detailed plan that integrates the best of all the plans with every good idea included (you don't need to mention which ideas came from which models in the final revised enhanced plan):" (Hell, one more prompt for kicks; I use this one to iteratively improve an existing markdown plan): "Carefully review this entire plan for me and come up with your best revisions in terms of better architecture, new features, changed features, etc. to make it better, more robust/reliable, more performant, more compelling/useful, etc. For each proposed change, give me your detailed analysis and rationale/justification for why it would make the project better along with the git-diff style change versus the original plan shown below:"

Jan 6
View on

The “core” consists of beads for task structure, bv for graph-aware routing, and Agent Mail for coordination. (Agent Flywheel)

The Breakthroughs

One innovation is that Flywheel treats issue tracking as a graph problem, not a list problem. bv uses signals like PageRank, betweenness centrality, and critical path analysis to surface the task that unlocks the most downstream work. That is a meaningful upgrade over letting agents grab whatever happens to be closest in scrollback. (MCP Agent Mail)

Another is Agent Mail’s coordination model. It gives agents identities, threaded inboxes, targeted rather than default-broadcast messaging, advisory file reservations with TTL expiry, searchable history, and optional pre-commit guards. Multi-agent coding usually fails less from lack of model capability than from collisions, stranded context, and silent divergence. Agent Mail is built to attack those failure modes directly. (MCP Agent Mail)

The most underrated idea may be fungibility. Emanuel argues against specialist-agent casts and against a boss agent that holds the whole project in its head. Every agent reads the same AGENTS.md, can pick up any bead, and can be replaced when a session crashes or degrades after compaction. Coordination should live in artifacts, not personalities.

A related Flywheel term is “landable.” A session only counts as done when a future swarm can restart from the repo, the beads, AGENTS.md, and the message threads without a human re-explaining the project. Even teams that never adopt Flywheel should steal that standard.

How To Classify It

The cleanest label for Flywheel is repo-native operating protocol. It is built for coding agents working inside a shared codebase. It is not a general business-workflow agent framework, and it is not just a nicer prompt template for one model session. Its center of gravity is artifact design, task routing, and coordination under failure.

Its boldest bet is the git model. The guide argues against worktrees and branch-per-agent workflows, recommends one shared workspace, and has agents pull, reserve files, edit, test, commit, push, and release reservations. Review is woven into the swarm itself. The guide is explicit that there is no pull request, no human reviewer, and no approval gate in that core implementation loop.

The public positioning is broader than the real sweet spot. The homepage says you can get started in about 30 minutes, spend roughly $440 to $656 a month on VPS plus model subscriptions, and even manage without coding experience if you can follow instructions. The core guide is narrower. It says the target reader is a relatively smart software developer who wants to coordinate multiple agents without chaos. The second description is closer to the truth. Flywheel rewards architecture judgment, debugging skill, and taste.

Where It Shines

Flywheel shines on work that is both spec-heavy and divisible. Think internal tools, multi-surface product features, backend plus frontend plus tests, large refactors with a clear dependency structure, or hardening passes that need implementation, review, linting, and bug scanning to happen in parallel.

Imagine adding enterprise SSO plus audit logs to an existing SaaS app. A good Flywheel decomposition would split provider abstraction and secrets wiring, identity schema changes, callback and session logic, admin-visible audit trails, UI states, and end-to-end failure cases into separate beads. Two agents can move on backend surfaces, one can own UI and tests, and a reviewer can cross-check integration.

For a first real trial, one could start smaller than that. Anthropic recommends beginning with research, review, or bug-investigation tasks that have clear boundaries. Flywheel’s own core guide says to start with 1, 2, or 4 agents, not a huge swarm, and check progress every 10 to 15 minutes. That is a better on-ramp than trying to spawn 10 terminals on day 1. (Claude)

Where It Breaks

The method stumbles when the work cannot be specified cleanly upfront. If the real job is product discovery, not engineering execution, front-loading 80 to 85 percent of the effort into planning can create false certainty. The guide argues that planning can discover requirements, and that is true up to a point. But user-facing ambiguity often resolves only after rough prototypes hit real users. Flywheel is strongest once the main workflows are already legible.

It also breaks down on same-file density. Anthropic’s docs warn that coordination overhead grows with team size and that two teammates editing the same file leads to overwrites. Flywheel uses reservations and pre-commit guards to manage that risk, but the core limit remains: if the work lives in the same few files, parallelism buys less than it costs. (Claude)

Then there is process fragility. The guide itself lists common failures: the plan-bead gap, duplicate beads once sets exceed 100 tasks, context-window exhaustion after a few polishing passes, communication purgatory, and overly long file reservations that block others.

Governance is the other hard limit. If your team needs protected branches, formal review, or compliance checkpoints, Flywheel’s default direct-to-main model will feel alien. You can still borrow the best parts, plan space, beads, AGENTS.md, graph triage, and explicit coordination, while keeping your existing branch and PR policy. In practice, that hybrid is probably the safest way for most teams to start.

A Practical Starter Recipe

The full guide itself recommends an incremental path: start with Agent Mail, beads, and bv, then layer on bug scanning, destructive-command guards, session search, and memory later. That sequence is one of the best parts of the public docs. (Agent Flywheel)

  1. Pick one real feature with clear boundaries. Anthropic says parallel work works best when tasks are self-contained and file ownership is clear, and Flywheel’s core guide says to start smaller than your ego wants to. (Claude)
  2. Write one serious markdown plan. The plan should cover workflows, architecture, sequencing, constraints, failure paths, and tests. Flywheel treats that document as the cheapest place to buy coherence. (Agent Flywheel)
  3. Turn the plan into rich beads, then polish them repeatedly. The core guide says single-pass beads are never optimal and recommends 4 to 6 review passes. (Agent Flywheel)
  4. Keep AGENTS.md lean but real. At minimum it should say what the repo is for, what tools exist, the non-negotiable rules, and how agents should use Mail, br, and bv. After compaction, agents should reread it. (Agent Flywheel)
  5. Launch 2 to 4 agents, not 12. Flywheel’s own starter guide says 2 to 4 is enough to feel real coordination. Claude’s docs say 3 to 5 teammates is a good default and warn that coordination and token cost rise with team size. (Agent Flywheel)
  6. If the swarm looks busy but the product is not getting closer, stop coding and step back into planning or bead repair. Both Flywheel guides call out this “reality check” explicitly. (Agent Flywheel)

The best adoption rule is to steal Flywheel in layers: plan harder, translate the plan into dependency-aware tasks, make AGENTS.md real, then add graph triage and explicit coordination. The memory stack and the rest of the ecosystem can wait.

Alternatives Worth Knowing

If you want a closer-to-default path, Claude Code Plan Mode plus experimental agent teams is the cleanest current alternative. Plan Mode gives read-only analysis before changes. Agent teams give you a lead, a shared task list, direct teammate messaging, plan approval when needed, and a recommended starting size of 3 to 5 teammates. That is less opinionated than Flywheel and easier to fit into existing workflows. It is also explicitly experimental and still carries the usual token and coordination costs. (Claude)

If you mostly want 1 strong partner, Aider is the lighter choice. It positions itself as AI pair programming in your terminal, offers code, architect, ask, and help modes, can follow repo conventions, and can auto-fix linting and testing errors. (Aider)

Gas Town sits on the other end: a persistent multi-agent workspace manager with git-backed work tracking and a worktree-oriented approach, plus explicit acknowledgment that Mail + Beads helped seed the design. (Gas Town)

If you are building agent products rather than running swarms inside a repo, LangGraph and OpenHands solve a different layer of the stack. LangGraph is low-level infrastructure for long-running, stateful workflows with durable execution, human oversight, memory, and debugging. OpenHands is a model-agnostic platform for coding agents that can run locally or at scale, integrate with GitHub, GitLab, and Slack, and open reviewable PRs across broader engineering workflows. Flywheel lives closer to the operator’s terminal and the repo’s internal memory. (LangChain Docs)

The Bottom Line

The best way to think about Flywheel is as a discipline for making a repo legible to machines. The biggest leap is not swarm size or model mix. It is the decision to move architecture, task boundaries, coordination, and recovery out of chat and into artifacts that survive crashes, compaction, and handoffs. That idea will outlast the current stack, and it is worth stealing even if you never adopt Flywheel whole. (Agent Flywheel)

Use the full method when the work is large, divisible, and already sharp enough to spec. Use a lighter loop when the task is small, exploratory, or boxed in by branch and review policy.