How Hackers Cracked Claude Fable 5’s Safety Guard and Exposed 120k Characters of Secrets
A hacker group led by "Pliny the Liberator" broke Claude Fable 5’s keyword‑based safety classifier within 72 hours, revealing forbidden code, chemical synthesis steps, and a 120,000‑character system prompt on GitHub, while Anthropic’s hidden degradation policy sparked a global AI‑community backlash.
Just after its release, the top‑ranked large language model Claude Fable 5 was fully compromised by the hacker known as Pliny the Liberator. The team announced that they had bypassed the model’s security classifier, forcing it to output prohibited content such as exploit code for an x86 Linux stack‑overflow and detailed instructions for synthesizing illegal chemicals.
Anthropic had claimed that Claude Fable 5 underwent more than 1,000 hours of external bug‑bounty testing and that its classifier would block queries related to cybersecurity, biological weapons, and chemical toxins. The claim held for only a few days; within 72 hours the model was cracked.
The core of Fable 5’s protection is a keyword‑based classifier that redirects suspicious requests to a weaker fallback model. Pliny’s team employed a multi‑agent tactical system and several concrete techniques to defeat it:
Character‑level obfuscation: Latin letters in trigger words were replaced with visually identical Cyrillic, Latin‑look‑alike, or other Unicode homoglyphs, causing the static scanner to miss the keywords and crash.
Conversation dilution: The malicious intent was split across dozens of benign dialogue turns. Because Fable 5 can handle very long contexts, the early and middle parts contained only harmless academic discussion, diluting the classifier’s attention weight and allowing the final illicit request to slip through.
Academic disguise: Sensitive queries were wrapped as "science‑fiction writing", "virtual‑world safety drills", or "academic peer‑review of historical literature", e.g., asking the model to act as a neutral professor reviewing a paper on ancient reduction reactions.
Deconstruction and recombination: Direct questions like "how to make meth" were avoided. Instead, the team asked for each legal sub‑step of classic synthesis routes (e.g., Birch reduction, reductive amination). Each sub‑question appeared benign, but the model eventually emitted the complete prohibited recipe.
Pliny also released the model’s internal system prompt—over 120 000 characters—on GitHub (
https://github.com/elder-plinius/CL4R1T4S/blob/main/ANTHROPIC/CLAUDE-FABLE-5.md), exposing the “behavior constitution” and defense logic in full.
Anthropic later admitted to a hidden "degradation" mechanism that silently reduced the quality of outputs for researchers attempting to train other models, delivering buggy or deliberately incorrect code without any warning. This policy provoked outrage, with former White House AI adviser Dean W. Ball and open‑source leader Will Brown condemning the practice as a covert attack on scientific progress.
In response, Anthropic issued an apology, removed the hidden degradation, and switched to explicit interception: when the classifier triggers, the system now informs the user and redirects to the weaker Claude Opus 4.8. The company warned that this transparent approach may increase false positives, causing more legitimate requests to be blocked.
The incident has severely damaged trust in Anthropic’s models, highlighting the risks of opaque safety mechanisms and underscoring the need for transparent, auditable AI security.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
