Alea
Back to Podcast Digest
Matthew Berman··17m

Hard Takeoff has started

TL;DR

  • Recursive self-improvement is already here — Matthew Berman argues the shift has happened because models are now helping design experiments, write code, run evaluations, and improve the systems that train their successors.

  • MiniMax says M2.7 participated in its own evolution — the Chinese open-source lab claims the model handled 30–50% of the workflow, updated its own memory, built RL harness skills, and delivered a 30% performance gain on internal evals.

  • OpenAI has said the quiet part out loud — Berman highlights GPT-5.3 Codex as “our first model that was instrumental in creating itself,” with early checkpoints debugging training, managing deployment, and diagnosing evals for later checkpoints.

  • Anthropic is doing the same thing through coding-first agent loops — even without using the phrase “self-improvement,” Claude Code is reportedly powering “almost all of our major agent loops,” including autonomous loops that write code, run tests, and iterate continuously.

  • This is no longer just a frontier-lab game — Berman points to Andrej Karpathy’s open-source auto-research project, where frontier models autonomously iterate on GPT-2 training, and says Karpathy achieved a state-of-the-art training speed after a single night.

  • Even non-ML practitioners can now run autonomous research systems — Berman describes his own OpenClaw setup, using a frontier model to run overnight fine-tuning experiments on models like Qwen 27B, compare them against Opus 4.6, and recursively generate new experiments without manual ML expertise.

The Breakdown

“We’ve crossed into it”: the opening claim

Berman starts with a big one: we are officially in the recursive self-improvement phase of AI. His framing is that the old bottleneck — human researchers manually driving every major advance — is beginning to break, and the new limit is mostly compute. In his telling, this is the opening move of an “intelligence explosion,” and people still aren’t talking about it like it’s already underway.

MiniMax M2.7 as the clearest proof point

His first exhibit is MiniMax 2.7, from the Chinese frontier lab MiniMax, which he says plans to open-source the weights soon. What grabs him is not just model quality, but the company’s description of how it was built: M2.7 “deeply participating in its own evolution,” updating memory, building RL harness skills, and improving its own learning process based on experiment results. The workflow is still human-plus-AI, but MiniMax says AI already handles 30–50% of it, and Berman sees that as the key threshold crossing.

The loop gets tighter: humans steer, agents do the work

He walks through MiniMax’s iteration system: humans set direction and review results, while the agent writes code, launches experiments, analyzes outputs, and feeds results back into the next cycle. The striking part for him is that M2.7 discovered optimizations on its own — things like sampling parameter tuning, workflow guideline improvements, and loop detection — producing a reported 30% boost on evals. His takeaway is simple and almost stunned: the model is now good enough to improve itself.

OpenAI and Anthropic confirm the pattern in different ways

Berman then stacks other labs on top of MiniMax. OpenAI is explicit: GPT-5.3 Codex was “instrumental in creating itself,” with early versions debugging training, deployment, and evaluations for later versions; he adds that if that was true for 5.3, GPT-5.4 is likely even deeper in the loop. Anthropic is more indirect, but he reads the same story in their posts about Claude Code powering “almost all of our major agent loops” and in examples where Claude autonomously writes features, runs tests, and iterates before humans refine the result.

Why coding matters more than it seems

He lingers on Anthropic’s coding focus because, in his view, coding agents are the machinery of recursive improvement. Yes, code is where the revenue is, but more importantly, coding agents build better developer tooling, infrastructure management, and training/deployment systems — exactly the layers that help labs move faster. That’s why he points to Claude’s recent shipping velocity as evidence that self-improvement is already translating into organizational speed.

The “situational awareness” graph and the Google example

Berman ties all this back to Leopold Aschenbrenner’s situational-awareness thesis: once we hit something like an automated AI researcher, progress goes vertical. He says we are standing at the bottom of that curve right now, with every major frontier lab showing signs of recursive improvement. Google’s AlphaEvolve is his supporting case here: a coding/science system that improved Google’s internal architecture, saved billions, and even found a faster matrix multiplication method for the first time in roughly 50 years.

Karpathy’s auto-research makes it accessible to everyone

The energy spikes again when he gets to Andrej Karpathy’s open-source auto-research project. Berman describes it as a setup where a frontier model like Opus or GPT-5.4 proposes experiments, edits a training script on a Git branch, runs tests, reviews the results, and repeats indefinitely with almost no human involvement. His favorite detail: Karpathy reportedly got to the fastest known time to train a GPT-2-class model after just one night.

Berman’s own overnight research loop

He closes by making it personal: he says he has no ML background, but he’s still using OpenClaw to run autonomous fine-tuning experiments on models like Qwen 27B against a baseline such as Opus 4.6. The system runs overnight, generates synthetic data, tweaks fine-tunes, swaps models when they outperform, and keeps iterating if they don’t. That’s his core thesis in one lived example: you no longer need deep ML expertise to participate — you just need to know how to point the agents.