Alea
Back to Podcast Digest
AskwhoCasts AI··1h 14m

Anthropic Responsible Scaling Policy v3: Dive Into The Details

TL;DR

  • RSP v3 shifts from hard-ish thresholds to “make a strong argument,” which Zev reads as a trust-based system, not a binding one — his core complaint is that Anthropic now decides for itself whether its own safety case is strong enough, after deprecating ASL-style release gates and replacing them with flexible plans, roadmap items, and risk reports.

  • Anthropic’s biggest safety retreat, in Zev’s view, is abandoning the pledge not to release potentially unsafe models when competition is intense — he says the new policy explicitly allows keeping pace in the race unless Anthropic has a significant lead or competitors already have stronger safety measures, which he calls a dropped flagship pledge.

  • The headline new threshold is automated R&D: a model that can compress two years of 2018–2024 AI progress into one year — Anthropic says this could arrive as soon as early 2027, but Zev argues the promised response—moonshot security, “eyes on everything,” alignment audits, better red-teaming, and risk reports—still doesn’t address the central recursive self-improvement threat.

  • The first 104-page risk report is useful and unusually candid, but Zev thinks it’s strongest where current models are weakest — he finds Anthropic’s case that Claude Opus 4.6 is not yet a serious sabotage risk broadly convincing, mostly because of limited capabilities, while criticizing the report for over-weighting “lack of propensity” and under-weighting future capability jumps.

  • Several risk categories were narrowed or dropped, which Zev treats as a major red flag — nuclear/radiological risk is gone, cyber is eliminated as a standalone threat despite OpenAI and Google still treating it as core, and autonomy is effectively assumed rather than evaluated, prompting his line that the goalposts keep moving.

  • The roadmap has concrete dates—July 2026 for a policymaker roadmap, January 2027 for world-class red-teaming and automated attack investigations—but Zev says the overall vibe is still “wing it” — his summary judgment is that Anthropic remains more serious than most labs, yet RSP v3 shows less security mindset than v1 or v2 and asks the public to rely on goodwill rather than enforceable commitments.

The Breakdown

Trust, Not Teeth, Is the New Operating System

Zev opens by saying RSP v3 should be read as a new document, not a continuation of Anthropic’s old promises—and that matters because, in his view, those old promises were what many people relied on. His big frame is brutal and simple: this is no longer a set of constraints so much as a plan of action built around flexibility, which means the real governing principle is trust.

The Missing Gate: Anthropic Can Still Just Release

He then walks through the architecture and keeps returning to one problem: there’s no real pre-deployment gate. ASL levels are deprecated, the company can revise plans, and “make a strong argument” becomes the standard—even though Anthropic is also the one judging whether Anthropic made a strong argument, which he sums up as, “They promise make a good argument. They decide what is a good argument.”

The Safety Pledge That Quietly Died

One of the sharpest sections is his attack on the race-dynamics logic. Anthropic now says that if pausing would just let less safe competitors move ahead, then releasing may still be the safer move; Zev reads that as taking back the commitment not to release potentially unsafe models, except maybe when Anthropic clearly has a lead.

What the New Risk Categories Catch—and What They Drop

The policy now centers four buckets: non-novel chem/bio, novel chem/bio, high-stakes sabotage, and automated R&D in key domains. Zev likes adding sabotage, especially around internal use inside major AI labs, but is openly incredulous that nuclear/radiological risk is gone, cyber is no longer a core category, and autonomy is treated as effectively already here rather than something to test.

Automated R&D Is the “Big Kahuna,” but the Mitigations Feel Small

Anthropic operationalizes the key threshold as a model that can compress two years of 2018–2024 AI progress into one year; Claude Opus 4.6 is estimated around 9%, GPT-5.4 Pro around 30%, and Thinking around 25%. Zev’s reaction is that this is exactly the moment where the policy should get most serious, yet the listed responses—moonshot security, “eyes on everything,” alignment assessments, stronger red-teaming, external review—still don’t meet the gravity of recursive self-improvement and rapid capability gain.

Governance Got Tweaked, but Not Strengthened Enough

There are some additions he likes: a responsible scaling officer, and explicit veto points for the board and LTBT when marginal-risk analysis is central. But he thinks even these are too conditional, because the CEO and RSO still retain broad discretion, and he wants major capability jumps to require explicit approval across leadership by default, not only in special cases.

The Roadmap Has Dates, but the Core Strategy Still Feels Like “Wing It”

The roadmap itself includes concrete milestones: a policymaker roadmap by July 1, 2026; constitution-upkeep work by October 1, 2026; world-class internal red-teaming, automated attack investigations, and movement toward “eyes on everything” by January 1, 2027. Since Anthropic says automated R&D could arrive as early as early 2027, Zev sees the timelines as an implicit admission that the crunch could come fast—and that the company is racing to build the plane while flying it.

The First Risk Report Is Good, Useful, and Still Not Reassuring Enough

In the back half, he goes through the 104-page risk report and gives Anthropic real credit: it’s candid, detailed, and much better than what most labs publish. He basically buys the conclusion that Claude Opus 4.6 is currently a very low but non-zero sabotage risk, yet keeps hammering one point: the report often attributes safety to benign intent when the stronger explanation is simply that the model still lacks the capability to pull off the scary thing.

Final Verdict: Better Than the Other Guys, Still Not Ready for the Real Problem

Zev ends in a conflicted place: Anthropic is still more curious and more serious than peer labs, and the reports create useful visibility. But he thinks RSP v3 lacks security mindset, has fewer hard commitments than previous versions, and shows too much optimism about alignment being manageable—his closing mood is basically, yes, Anthropic may still be the best house on the block, but you’re still being asked to trust the house not to catch fire.