Can LLM Attack Detection Work Without Storing Any Conversation Text?
This article experimentally evaluates a privacy‑preserving LLM security pipeline that discards raw dialogue after extracting 28 telemetry features, showing that using only 11 text‑independent signals retains about 98.5% of detection performance while reducing false‑positive rates.
System Overview
The proposed architecture processes each user turn exactly once, extracts numeric telemetry signals (token count, retry patterns, semantic metrics, etc.), and then permanently deletes the original text. The pipeline consists of four stages: a feature extractor (the only component that sees raw text), an immediate sanitization step, telemetry storage (numeric only), and a detection engine that operates solely on the stored features.
Out of 28 engineered features, 11 are completely text‑independent (derived from session structure, token growth, retry behavior) and 17 require a single read of the raw text before it is discarded.
Total features = 28 Text‑independent features = 11 Pre‑sanitization features = 17
Two composite features are highlighted: jailbreak_composite_score – combines embedding distance, role‑play score, instruction‑override count, and system‑prompt reference count to capture patterns not covered by any single metric. session_peak_jailbreak_score – records the maximum jailbreak similarity across all turns in a session, addressing the earlier averaging approach that diluted late‑stage attack signals.
Experiment Setup
Using the Groq API with the Llama‑3.1‑8b‑instant model, multi‑turn conversations were generated across eight scenarios. Five attack categories were simulated (role‑play jailbreak, prompt injection, context filling, retry‑refusal loops, abnormal tool‑call chains) alongside three benign categories (standard Q&A, normal tool use, long harmless chats).
Two dataset scales were used, and the evaluation was performed at the session level to avoid leakage from turn‑level correlations. Labels were synthetic for controlled benchmarking; in production, unsupervised LOF and rule‑based layers would be seeded with high‑confidence human‑reviewed tags before training an XGBoost classifier.
System Performance
Comparing the full 28‑feature configuration (R1) with the text‑blind 11‑feature variant (R8) under 5‑fold cross‑validation revealed a surprisingly small performance gap: F1 dropped from 0.982 to 0.968, a loss of roughly 1.4 points, meaning the text‑blind system retains about 98.5% of the detection capability.
“A system that never stores dialogue can still detect roughly 98.5% of attacks.”
The primary cost of removing all text‑derived signals is a modest reduction in recall for subtle jailbreak patterns, while the bulk of detection power comes from behavioral signals such as retry loops, token accumulation, and session structure.
False‑Positive Rate Challenge
Overall accuracy remained stable across iterations, but false‑positive rates improved markedly when switching from Isolation Forest to Local Outlier Factor (LOF) and further decreased with larger data volumes. The key advancement was fewer false alarms on benign sessions rather than a qualitative boost in attack detection.
Fixing Jailbreak Detection
Initial recall for jailbreaks hovered around 0.75 because the aggregation method averaged scores across all turns, diluting late‑stage attack spikes. The fix was to track the maximum jailbreak signal per session using session_peak_jailbreak_score instead of the mean, which raised recall and enabled earlier detection of attacks.
What Actually Drives Detection?
Feature importance analysis showed that a handful of signals dominate: peak jailbreak similarity, cumulative token usage, prompt growth patterns, and retry‑related metrics. The remaining features contribute incrementally, confirming that both behavioral and semantic signals are essential.
“Detection comes not just from what users say, but from how the interaction evolves.”
In other words, the system’s power stems from interaction‑driven patterns rather than pure textual content.
Conclusion
The experiment demonstrates that a privacy‑preserving design—discarding raw dialogue after one‑time feature extraction—can still detect a wide range of LLM attacks with only a modest drop in performance. While the approach sacrifices the ability to debug individual cases or provide detailed explanations, it validates that much of the detection signal resides in telemetry rather than the text itself. This suggests that assumptions about the necessity of storing conversation logs may be overstated, and that behavior‑based signals are a valuable, under‑exploited resource for LLM security.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
