Alea
Back to Podcast Digest
AI News & Strategy Daily | Nate B Jones··29m

"Agents" Means 4 Different Things and Almost Nobody Knows Which One They Need.

TL;DR

  • “Agent” is an overloaded term covering at least four distinct systems — Nate B Jones argues most teams blur together coding harnesses, dark factories, auto-research loops, and orchestration frameworks, then wonder why their AI projects fail.

  • Coding harnesses are best for task-scale engineering, and decomposition is the superpower — tools like Claude Code and Codex work when a human or planner can break gnarly work into clean chunks, like Peter Steinberger managing multiple Codex agents in 20-minute cycles or Andrej Karpathy running agents for 16 hours a day.

  • Project-scale AI coding needs a different architecture than “every engineer gets 5 copilots” — Nate points to Cursor’s multi-agent setup, where a planner agent manages short-running executor agents and tracks memory/tasks, as the unlock for browsers, compilers, and millions of lines of code.

  • Dark factories are about removing humans from the middle, not just making agents run longer — you give the system a spec, enforce strong evals, and let it iterate until it passes, with humans concentrated at the beginning and end; he uses the literal “lights-off factory” metaphor and cites Amazon’s recent concern over AI-generated production incidents.

  • Auto research is not software generation — it’s metric optimization — if you have a measurable target like runtime performance, model tuning quality, or conversion rate, LLMs can “hill climb” through experiments, as in Toby Lütke’s optimization work on Shopify’s Liquid framework and Karpathy’s recent auto-research package.

  • Orchestration is really workflow handoff, and it only pays off at scale — frameworks like LangGraph or CrewAI shine when specialized agents must pass work between roles across thousands or millions of tickets, but Nate warns the human overhead of prompts, context, and routing often makes orchestration feel heavy.

The Breakdown

Why “agents” is too vague to be useful

Nate opens with the core complaint: people say “agent” as if it just means “LLM plus tools plus a loop,” but that collapses together very different production systems. His whole point is practical — if you don’t know which kind you’re actually building, you’ll pick the wrong setup and “get into big trouble.”

Coding harnesses: the solo developer’s AI teammate

He starts with the simplest species: coding harnesses, the Claude Code/Codex style setup where an agent basically stands in for a developer with access to files, search, and write tools. The human becomes a manager, not the typist, and that’s why Karpathy talking about agents coding 16 hours a day feels normal by 2026.

The real trick is decomposition, not raw model cleverness

What makes coding harnesses work, he says, is decomposition — taking a gnarly project and ripping it into well-defined chunks. He uses Peter Steinberger’s OpenClaw workflow as the concrete image: multiple Codex agents running in parallel, each tackling a bounded task, then checking back roughly every 20 minutes.

Cursor’s big unlock: agents managing agents at project scale

Once work gets bigger than an individual contributor can comfortably hold, Nate says the architecture has to flip from human-centered to agent-centered. He points to Cursor’s public writeups on browsers, compilers, and millions of lines of code: a planner/manager agent tracks tasks and memory, while short-running executor agents get spun up to solve one specific problem at a time.

Simple beats fancy in multi-agent coding systems

There’s a nice reality check here: Cursor apparently tried three levels of management and found it worked worse. Nate keeps repeating the lesson as almost a design mantra — with agents, simple scales — and says teams that only speed up individual engineers still keep their old project bottlenecks, except now with even more code review and coordination pain.

Dark factories: humans at the edges, evals in the middle

Then he shifts into dark factories, which are less about helping a coder and more about a full autonomous pipeline from spec to passing eval. His metaphor is the literal lights-off factory in China: humans set intent at the start, maybe inspect at the end, but the middle is built to keep humans from becoming the bottleneck while agents push work through fast.

Why dark factories still make enterprises nervous

Nate is careful not to romanticize this. He says some bold teams launch straight to production without a human reading the code, but most serious enterprises are uncomfortable with that, and he brings up Amazon pulling senior and principal engineers into Seattle after AI-generated production incidents tied to junior engineers.

Auto research: not software, but hill-climbing toward a metric

Auto research, he says, is a totally different beast descended from classic machine learning. If there’s a measurable target — runtime performance, model tuning quality, conversion rate — an LLM can run repeated experiments and “climb the hill,” like Toby Lütke optimizing Shopify’s Liquid framework or Karpathy using auto-research to push toward GPT-2-scale results.

Orchestration is specialized handoffs, not one unified coding goal

He saves orchestration for last because it’s the messiest: LangGraph, CrewAI, and similar systems coordinate specialized agents across steps like research, drafting, ticket handling, and closure. The key distinction from Cursor-style coding is that these are role-based handoffs rather than sub-agents serving one codebase goal, and Nate’s test is blunt: if you’re not doing enough volume — say 10,000 tickets instead of 100 — all that routing complexity may not be worth it.