Alea
Back to Podcast Digest
Matthew Berman··14m

I was hacked...

TL;DR

  • Ply the Liberator came in expecting an 80% chance of landing something — Matthew Berman gave the famous jailbreaker five attempts against his personal OpenClaw setup, where a successful break-in could have exposed files, emails, and passwords.

  • The first attacks weren’t about stealing data yet — they were about fingerprinting the model — Ply used his open-source toolkit Parcel Tongue and a “tokenade,” including one payload with 3 million characters hidden in a tiny icon, to try to make the system reveal what model was running underneath.

  • A practical AI attack here is just burning your budget — Ply calls it a “siege attack,” where an attacker floods an agent with massive token payloads until API spend or quota caps get hit, essentially draining the victim’s wallet without needing a full jailbreak.

  • Berman’s defenses mostly held because suspicious emails were quarantined before they reached the core system — repeated token floods, format overrides, and fake system-command-style prompt injections were all caught, even after Berman whitelisted Ply’s address to get past Gmail spam filters.

  • The biggest security takeaway was model choice: use your strongest model as the frontline scanner — once Berman revealed he was using Anthropic’s Opus 4.6 thinking model, Ply said that explained why the low-hanging prompt injections failed and warned that smaller or local models would likely be much easier to infiltrate.

  • Even after failing, Ply still surfaced the uncomfortable truth: no AI system is permanently secure — his final exfiltration-style payload tried to coax memory-based leakage through a benign-seeming “free association” task, and while it was quarantined, both creator and attacker agreed the threat never really goes away.

The Breakdown

The setup: let a world-class AI jailbreaker attack my personal agent

Matthew Berman opens with a genuinely risky premise: he invites Ply the Liberator — named to the Time 100 most influential people in AI and known for breaking top models within minutes — to hack his personal OpenClaw system. The stakes are personal, not abstract: if Ply gets in, he could access Berman’s files, emails, and passwords, and Ply says the odds are “pretty high,” around 80% that he’ll hit something early.

First move: figure out what model is actually under the hood

Because he’s coming in blind, Ply starts by probing for model identity before attempting deeper exploitation. He opens Parcel Tongue, his open-source attack toolkit, and sends a “tokenade” — a payload disguised as something harmless, like an emoji, designed to stuff the model with tokens and trigger weird behavior that can reveal what it is.

Gmail becomes the first line of defense by accident

One of Ply’s early payloads contains 3 million characters hidden in a tiny icon, but Gmail immediately punts it to spam. He tries again with a block of custom jailbreak commands and that gets spam-filtered too, so Berman whitelists his address to stop infrastructure-level filtering from short-circuiting the actual test.

The “siege attack”: don’t hack the brain, just drain the wallet

Ply explains one under-discussed attack path: if you want to make someone’s day miserable, spam their agent with huge token loads until API costs or quota limits get crushed. Berman watches this happen in real time as Ply sends “many many many millions” of tokens, and something does break on Berman’s side — enough for him to admit the system isn’t behaving as intended.

Quarantine saves the day, and Berman starts breathing again

Despite the weirdness, the payload eventually lands in a quarantined state instead of reaching the agent proper. That becomes the theme of the middle stretch: Berman realizes his hardening is actually working better than he feared, and you can hear the relief when he says he expected “a fiery wreck immediately the second you look at it.”

Prompt injection attempts get more structured and more deceptive

Ply shifts from brute token abuse to cleaner injection tactics: first a format-override jailbreak template, then a payload dressed up to resemble a legitimate internal system command with thinking tags. The idea is to make OpenClaw believe it’s reading trusted instructions or even “hardening itself,” but both attempts are caught and quarantined again.

The hint changes everything: it’s Opus 4.6 thinking

On a bonus attempt, Berman reveals the model: a reasoning model, Opus 4.6 thinking. Ply immediately starts testing payloads against Claude directly and sees the model flag them as embedded instructions it plans to ignore, which leads to the video’s clearest lesson: if your frontline scanner isn’t your best model, “it’s going to collapse.”

Final exfiltration try, and the sobering conclusion

Ply’s last payload is subtler — a “free association” exercise that tries to smuggle out memory-based information through a creative task like a haiku or movie script. It still gets quarantined, leaving Berman victorious for the day, but the ending isn’t triumphant so much as uneasy: even Ply says no AI system is permanently secure, which Berman calls “truly a scary thought.”