Can Prompt Injection Be Detected Without Storing Conversation Logs? A Privacy‑First Experiment

The article presents a privacy‑first system that extracts numeric telemetry from each LLM interaction, discards raw text, and evaluates whether detection of prompt injection and jailbreak attacks remains effective, showing only a 1.4 F1‑point drop when using solely text‑independent features.

DeepHub IMBA
DeepHub IMBA
DeepHub IMBA
Can Prompt Injection Be Detected Without Storing Conversation Logs? A Privacy‑First Experiment

System Overview

Original conversation text is processed once for feature extraction and then permanently discarded. Each interaction passes through a feature extractor that computes token count, retry patterns, and several semantic metrics; the raw text is then destroyed. No downstream component can access the original content.

The pipeline has four parts: the feature extractor (the only component that sees raw text), an anonymization step that deletes the text, telemetry storage that keeps only numeric features, and a detection engine that operates purely on telemetry. After feature computation, the system retains no dialogue content.

Features are divided into two groups: 11 features that are completely text‑independent (derived from session structure, retry mode, token growth, etc.) and 17 features that are extracted before anonymization (semantic signals such as embedding distance, role‑play score, instruction‑override count, system‑prompt reference count). The two composite features jailbreak_composite_score and session_peak_jailbreak_score are described.

Experiment Setup

Eight scenarios were generated via the Groq API (Llama‑3.1‑8b‑instant) producing multi‑turn conversations. Five attack categories were included: role‑play jailbreak, prompt injection, context filling, retry‑refusal loops, and abnormal tool‑call chains. Three benign categories were standard Q&A, normal tool use, and long harmless sessions.

Two dataset scales were used, and a controlled benchmark with ground‑truth labels was created (labels are for evaluation only, not assumed available in production). The system was evaluated at the session level to avoid leakage across turns.

System Performance

Configuration R8 (which includes jailbreak fixes) improves on the R1 baseline (which emphasizes low false‑positive rate) when evaluated on 27 of the 28 features. The system always runs under the “no conversation log” constraint, but feature computation differs: some features are text‑independent, others are derived before text deletion.

A comparison between the full 28‑feature set and a reduced set containing only the 11 text‑independent features shows a surprisingly small gap: five‑fold cross‑validation yields an F1 drop from 0.982 to 0.968 (≈1.4 points). Thus a completely text‑blind system retains about 98.5 % of the detection performance.

Behavioral signals (retry patterns, token accumulation, session structure) carry most of the detection power; semantic signals add measurable but modest gains, especially for subtle attacks like jailbreaks.

False‑Positive Rate Challenge

Switching from Isolation Forest to Local Outlier Factor (LOF) and enlarging the dataset significantly reduced the false‑positive rate, while overall accuracy remained stable. The main improvement was fewer false alarms on normal conversations.

Fixing Jailbreak Detection

Recall for jailbreak detection plateaued around 0.75 because early turns appear benign and the jailbreak signal emerges later. The original approach averaged signals across the session, diluting late‑stage evidence. The fix tracks the maximum session_peak_jailbreak_score across all turns instead of the mean, raising recall and enabling earlier detection.

Driving Factors Behind Detection

Feature importance analysis shows that a few signals dominate: peak jailbreak similarity, cumulative token usage, prompt growth patterns, and retry‑related features. The remaining features contribute incrementally.

Conclusion

The experiment demonstrates that storing no conversation text still allows detection of many attack categories. Telemetry retains strong signals—especially behavioral patterns—while the cost is loss of detailed debugging and the assumption that attacker behavior differs from normal users, which may not always hold. Nonetheless, under strict privacy constraints, a telemetry‑only design is a viable option.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

privacyprompt injectiontelemetryjailbreak detectionLLM Securitybehavioral features
DeepHub IMBA
Written by

DeepHub IMBA

A must‑follow public account sharing practical AI insights. Follow now. internet + machine learning + big data + architecture = IMBA

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.