Alea
Back to Podcast Digest
AskwhoCasts AI··48m

Anthropic Responsible Scaling Policy v3: A Matter of Trust

TL;DR

  • Anthropic’s RSP v3 drops the old ‘pause if dangerous’ posture in favor of softer judgment calls — Zvi Mowshowitz says the core shift is from concrete if-then commitments to “we will make reasonable to us arguments” about whether benefits outweigh risks, especially under competitive pressure.

  • The real controversy isn’t just policy design — it’s trust — the video argues Anthropic benefited for years from outsiders, employees, and recruits treating the RSP as a real binding commitment, with receipts including Evan Hubinger’s past claim that Anthropic’s pause conditions were “very clear” and in “big bold font on the second page.”

  • Zvi gives Anthropic partial credit for admitting the change before violating the old policy in practice — he repeatedly says it is higher-integrity to announce the break now than to silently rewrite the rules right before a release, even while insisting the company should still bear consequences for broken promises.

  • Competition logic has effectively won inside frontier AI — the headline change is that Anthropic no longer commits to hold back if rivals race ahead, and Zvi reads Holden Karnofsky and Jared Kaplan as saying unilateral slowdowns no longer make sense in the strategic environment of 2026.

  • This weakens the whole voluntary self-governance story, not just Anthropic’s brand — Zvi argues RSPs were supposed to show that labs could bind themselves and maybe seed regulation like SB 53 and RAISE, but v3 instead teaches governments and rival labs that “there are no commitments, only statements of intent.”

  • The broader pattern matters: this feels like a repeat of Anthropic’s earlier ‘we won’t push the frontier’ impression — Zvi cites Ruben Bloom, Oliver Habryka, Eliezer Yudkowsky, Garrison Lovely, and others to argue that safety-flavored assurances keep eroding once they become expensive, which should update everyone’s priors about future lab promises.

The Breakdown

April Fool’s Day, but the joke is on anyone who trusted the old RSP

Zvi opens with a blade: yes, the post landed on April 1, but the real April Fools are the people who believed Anthropic’s previous responsible scaling commitments. His basic frame is not “Anthropic is evil,” but “they made promises people relied on, and walking those back damages trust, coordination, and future safety governance.”

What actually changed in RSP v3

The headline change, as he puts it, is that Anthropic is no longer meaningfully committed to not releasing potentially dangerous models if someone else goes first — “because, you know, they started it.” Anthropic’s own retrospective says the RSP helped produce stronger safeguards, ASL-3 implementation, and copycat frameworks at OpenAI and DeepMind, but failed to create consensus on risk levels or timely government action.

From hard tripwires to vibes and ‘reasonable arguments’

Zvi says the old regime at least aspired to crisp thresholds: if capabilities outran safeguards, pause. V3 abandons that for a softer model where Anthropic will weigh risks and benefits and proceed when it thinks its case is reasonable, which he calls honest but still a major downgrade from actual precommitment.

Better to confess than to quietly cheat

One of the more striking parts of the video is that Zvi gives Anthropic real credit for saying the quiet part out loud. If the company was already effectively doing ad hoc judgment calls around releases like Opus 4.6, he says it is better to admit that than to keep pretending there were real red lines.

The receipts: employees sold this as binding

A huge chunk of the episode is devoted to showing that this wasn’t just outsider confusion. Zvi quotes Evan Hubinger sharply defending the old RSP as a clear commitment to pause, and leans on Oliver Habryka’s claim that Anthropic employees told him many times the policy “binds them to a mast,” including willingness to pause if they couldn’t achieve robustness against state-backed hackers.

Why this lands like a betrayal, even if revision was inevitable

Zvi’s point is subtle: even if the old commitments were unworkable and should be revised, breaking them still costs something. He says Anthropic and its staff gained social license, safety-minded recruits, less criticism, and possibly policy influence by letting people believe the commitments were real — so “we can revise living documents” doesn’t erase the moral or strategic bill.

This also kills the broader theory of voluntary self-regulation

The video keeps zooming out from Anthropic to the ecosystem. If eval-triggered commitments collapse the moment they become expensive, then the entire hope that labs like Anthropic, METR, or Apollo could create credible if-then slowdown regimes starts to look flimsy; once dangerous thresholds arrive, the fallback is just “vibes.”

The race logic is now explicit

By the end, Zvi reads Holden Karnofsky, Sam Bowman, Jared Kaplan, and Drake Thomas as all, in different tones, conceding the same reality: unilateral pause is basically off the table unless things get cartoonishly catastrophic. His closing energy is half grief, half grim updating — better to know the experiment in self-binding failed than to keep comforting yourself with promises that turn into aspirations the moment the bill comes due.