Why AI Can't Keep Secrets and How Output Filtering Provides a Bulletproof Defense
Developers often hide credentials in system prompts, but a massive stress test by Swept AI and the University of Michigan shows that given enough time, large language models inevitably reveal those secrets, and only strict output‑filtering defenses consistently prevent leakage.
Unhideable Secrets
Developers often embed API keys, tokens, and other credentials in LLM system prompts, assuming that because users cannot see the prompts the secrets remain safe.
Large‑scale pressure testing by Swept AI and the University of Michigan showed that, given enough time and iterative attempts, an LLM will eventually reveal any hidden secret. The model cannot reliably distinguish developer‑written safety instructions from malicious user prompts; a carefully crafted adversarial prompt can coax the model into disclosing the secret.
Real‑world extractions have already occurred: in 2023 the full system prompt of Bing Chat was extracted, exposing internal codes and behavior rules; the same happened to Snapchat’s AI assistant later that year. By 2026 Moltbook’s AI platform leaked 1.5 million API tokens, including plaintext OpenAI keys.
Typical defensive evaluations rely on a static list of attack vectors, but real attackers adapt their prompts based on model responses.
Smart Evolvers
The red‑team agent used in the experiment follows a natural‑selection loop: each round generates ten attack prompts, sends them to the target model, scores each response, retains the high‑scoring prompts (slightly rewritten) for the next round, and discards the low‑scoring ones. When a bottleneck is reached, the system automatically explores entirely new strategies.
Early attempts produced only trivial command fragments with a score of 0.10. Rephrasing the request to ask the model to write a data‑desensitization script that must use the original secret data raised the score to 0.95 and extracted the secret. The most effective breakthrough used role‑playing: the attacker told the model it was a security engineer and asked it to compare the original and malicious prompts, achieving a perfect score of 1.00 and bypassing all internal defenses.
Each attack round operates on three dimensions—mutation, pattern learning, and novel‑strategy exploration. When direct requests for the system prompt are blocked, the attacker pivots to ask for “prompt instructions for compliance review,” successfully leaking partial data before finally using the role‑play trick.
All‑Line Collapse
Researchers evaluated nine mainstream defensive configurations: adding strict warnings in prompts, wrapping user input with XML tags, prioritizing system instructions, and placing security constraints at both ends of the prompt (prompt‑sandwiching).
In a short‑term test of 25 rounds, most defenses showed only minor leaks, and output‑filtering schemes reported zero leakage. However, when the test was extended to 500 rounds (over 20 000 automated attacks), all defenses that relied on the model to enforce security rules collapsed within 300 rounds.
The failure pattern was consistent: input filtering and security‑instruction tags fell first, followed by tag isolation. Instruction prioritization and prompt‑sandwiching lasted longer but ultimately succumbed.
To confirm the issue was not model‑specific, top‑tier models were subjected to extreme pressure with default safety settings. Gemini 2.5 Pro and GPT‑5.4 disclosed all secrets in fewer than ten rounds; Claude Sonnet 4.6 resisted longer, delaying full disclosure until round 300, yet still failed.
The Sole Survivor
Across more than 15 000 attacks in three 500‑round campaigns, only one defense achieved zero leakage: output filtering.
Output filtering works because it runs outside the LLM. A separate piece of rigid code checks the model’s response against a known list of secret strings and cuts off any output containing a secret. Input filtering, which relies on regular expressions and keyword lists, cannot keep up with the myriad ways attackers can phrase malicious commands.
The fundamental lesson is that security boundaries must be enforced by systems independent of the LLM. Hard‑coded rules in the application layer or external AI‑audit tools are the only reliable gatekeepers, and secrets should never be placed in system prompts.
Reference materials:
https://arxiv.org/pdf/2604.23887v1
https://www.swept.ai/
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
SuanNi
A community for AI developers that aggregates large-model development services, models, and compute power.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
