Alea
Back to Podcast Digest
Matthew Berman··11m

The ONLY benchmark that AI can't solve (humans ace it)

TL;DR

  • ARC-AGI is still the benchmark AI hasn’t crushed — Matthew Berman calls it the only major benchmark that hasn’t been saturated, with humans solving it at 100% while ARC-AGI 3 frontier models score under 1%.

  • The benchmark is really about generalization, not memorization — in ARC-AGI 1 and 2, you infer rules from a few examples and apply them to a new case, which feels trivial to humans but still trips up top models like GPT, Gemini, and Claude.

  • ARC-AGI 2 gets expensive fast without getting close to human performance — Berman highlights GPT 5.4 Pro Extra High scoring 72% at roughly $39 per task, with Gemini 3.1 Pro at 69% and Claude Opus 4.6 Medium at 68%, still far from the human 100%.

  • ARC-AGI cares about efficiency, not just raw capability — unlike benchmarks where you can throw tokens at the problem, the leaderboard tracks cost per task, making it a test of economical reasoning as much as accuracy.

  • ARC-AGI 3 turns the benchmark into a zero-instruction video game — you’re dropped into an unfamiliar interactive environment with limited moves and no tutorial, and Berman solves one by noticing a plus-shaped switch changes the goal’s orientation before moving to the exit.

  • Frontier models basically faceplant on the interactive version — Berman shows GPT 5.4, Gemini 3.1 Pro Preview, Grok 4.2, and Claude Opus 4.6 failing on the showcased task, with the top model reaching just 0.3% at a cost of over $5,000.

The Breakdown

Why ARC-AGI still matters

Berman opens with a big claim: ARC-AGI is the only benchmark that AI hasn’t fully saturated, and ARC-AGI 3 makes that gap even starker. His headline stat is simple and brutal — humans get 100%, AI gets less than 1% — which is why he calls it the coolest benchmark out there.

The easy-looking puzzle that machines still fumble

He walks through ARC-AGI 1 with a toy example: see a few pink three-square shapes, infer that adding a yellow square completes each 2x2 block, then apply that rule to a new grid. For a person, the answer pops out almost instantly; that’s the point. The benchmark is supposed to feel easy to humans while exposing how shaky AI still is at generalizing from tiny amounts of information.

ARC-AGI 2 raises the difficulty without changing the core idea

The second version keeps the same “infer the rule from examples” setup but makes the latent logic much murkier. Berman works through a color-coded shape puzzle where yellow, green, blue, and red map to different internal gap patterns, and you can feel him reverse-engineering the rule live. It’s still solvable by ordinary people, but no longer obvious at a glance.

The leaderboard tells a very different story from typical AI benchmarks

On ARC-AGI 1, the best models are already around 93-94%, so it’s close to being maxed out. ARC-AGI 2 is where the separation shows: GPT 5.4 Pro Extra High hits 72% at $39 per task, Gemini 3.1 Pro gets 69%, and Claude Opus 4.6 Medium lands at 68%. Berman’s contrast is sharp: on coding or math benchmarks, AI beats elite humans; here, average humans still beat the models.

What makes this benchmark unusually important

Berman pauses before ARC-AGI 3 to explain why ARC-AGI feels special. First, it tracks cost per task, so brute-forcing with huge token spend is part of the problem, not a workaround. Second, it measures something closer to everyday flexible reasoning — the kind regular humans use constantly — rather than specialist expertise.

ARC-AGI 3 becomes a no-instructions game

Then the benchmark changes form completely: instead of static examples, you’re dropped into a little game world with arrows, a reset button, a yellow bar, a maze, and no explanation. Berman narrates his own thinking in real time, guessing what the UI means, testing one move, noticing the bar drop, and realizing the plus-shaped object probably changes the goal state. That “let me poke at the environment and form a theory” loop is exactly what the benchmark is testing.

The human solve versus the model failure

He solves the game by hitting the plus first, which reorients the target, then moving to the exit — something he says would have taken about a minute without the commentary. Watching GPT 5.4 try the same task is almost painful: it takes the first step correctly, keeps returning to the same wrong area, and never thinks to touch the plus. Berman sounds genuinely stunned because to him it feels “so obvious,” a reminder that human intuition is carrying a lot more than we realize.

Less than 1%, more than $5,000, and a $2 million prize

The results are wild: GPT 5.4, Gemini 3.1 Pro Preview, Grok 4.2, and Claude Opus 4.6 all effectively fail the showcased interactive task, while humans stay at 100%. He says the top model overall scores just 0.3%, and does it at a cost north of $5,000. ARC has released a paper, opened the benchmark for people to try, and attached a $2 million prize to anyone who can saturate it.