What Do Your Logits Know? Surprising Insights from Apple’s New AI Paper
Apple’s recent AI paper probes whether large vision‑language models truly forget user data by examining residual streams and final logits, revealing that hidden image attributes persist in top‑k outputs and exposing significant privacy and security risks.
Apple’s AI research team recently released the paper “What do your logits know? (The answer may surprise you!)” (arXiv:2604.09885), which investigates whether large vision‑language models truly forget information after processing.
Information Bottleneck Principle
The authors introduce the Information Bottleneck Principle, illustrating it with a CEO deciding on an acquisition: only decision‑relevant data should survive compression, while irrelevant details are discarded. The same idea applies to visual‑language models, where irrelevant image features should be filtered before the final answer.
Experiment Design
Two lightweight probe networks are attached to specific model layers: the Residual Stream , which stores all hidden states, and the final Logits , the raw probability scores before the last token is chosen. Experiments use the synthetic CLEVR dataset and the real‑world MSCOCO dataset, adding various perturbations such as Gaussian noise, glass blur, and motion blur.
Probes are trained to infer image attributes (noise level, object color, background objects) from the selected layers after the model answers a simple visual question.
Seven Findings
1. Residual Stream as an Oracle
The residual stream retains almost all image details, allowing probes to recover noise type, object shape, color, and even unrelated background attributes with near‑perfect accuracy, indicating no effective compression at this stage.
2. Low‑dimensional Projections Still Leak Secrets
Using Tuned Lens to map residual stream trajectories to Logit space, probes can extract core decision information and background features from the top‑2 trajectories, showing that information bottleneck filtering does not occur.
3. Final Logits Encode Decision and Target Information
At the last layer, some compression happens but is insufficient; probes can accurately predict image noise level and type from the top‑2 logits.
4. Unasked Attributes Appear in Top‑k Logits
Even when a prompt omits certain object properties (e.g., material or size), probes can infer these from the top‑0.5L logits, revealing that the model carries redundant target features to the output.
5. Logits Record Environmental Context
Beyond the target object, increasing the number of examined logits allows accurate prediction of background object count, color, and other scene attributes, exposing hidden environmental data.
6. Leakage Peaks with ~60 Logits (U‑shaped Curve)
Accuracy rises sharply when observing 30‑80 logits, then drops as more logits add high‑dimensional noise, indicating that a small head of the output distribution is sufficient for privacy leakage.
7. Top‑k Logits Match Deep‑Layer Risks
When the observation dimension is held constant, extracting information from top‑k logits (often exposed via public APIs) is as effective as accessing deep internal states, challenging the belief that gray‑box API access is inherently safe.
Privacy and Security Implications
The findings highlight a serious privacy risk: even a simple visual‑question‑answer API that returns only a short answer and top‑k probabilities can inadvertently expose detailed background and personal information contained in the uploaded image. Malicious actors could reconstruct private attributes from these probability scores, and the residual hidden information also contributes to hallucinations in generated text.
Conclusion
The paper warns that the seemingly harmless top‑k logits of large models can act as a “recording device” for user data, posing a damoclean over generative AI deployments and urging stronger safeguards for privacy‑preserving model design.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
