Artificial Intelligence 11 min read

Transformer Hidden States Can Reconstruct Input with 100% Accuracy – New Invertibility Study

A recent paper from Sapienza University's GLADIA Lab shows that mainstream Transformer language models are injective, enabling a novel SIPIT algorithm to recover original text from hidden states with perfect accuracy, while extensive experiments confirm the models retain all input information.

Machine Learning Algorithms & Natural Language Processing

Mar 7, 2026

Transformer Hidden States Can Reconstruct Input with 100% Accuracy – New Invertibility Study

The paper "Language Models are Injective and Hence Invertible" (GLADIA Research Lab, Sapienza University of Rome) claims that popular Transformer language models preserve input information without loss, making them mathematically invertible.

The authors first validated injectivity by feeding six representative models—GPT-2, Gemma-3, LLaMA-3.1, Mistral, Phi-4-mini, and TinyStories—with over 100,000 samples drawn from Wikipedia, C4, The Pile, and GitHub Python code. For each sample they extracted the final token's hidden state at every layer and computed Euclidean distances between all pairs. Any distance below 10⁻⁶ would indicate a collision. After more than five billion pairwise comparisons, the smallest distance remained far above the threshold, and even a stress test with over three trillion combinations of the ten most semantically similar inputs produced no collisions, confirming practical injectivity.

To test reversibility, the team introduced the SIPIT (Sequential Inverse Prompt via Iterative Updates) algorithm. Leveraging the causal structure of Transformers—where the hidden state at position t depends only on tokens 1…t—SIPIT iteratively reconstructs the input sequence from hidden states alone. Experiments showed SIPIT recovers the exact original text for both natural language and code data with 100% accuracy, operating in linear time and running significantly faster than brute‑force enumeration.

The authors also provided a theoretical analysis of training dynamics, proving that both gradient descent and stochastic gradient descent constitute continuous, invertible transformations. Consequently, the models retain the injective property throughout training, from random initialization to convergence.

Beyond the technical contribution, the study highlights privacy implications: because hidden states can fully reconstruct inputs, storing or transmitting these activations may expose user data. The authors advise careful handling of internal activations and suggest that model compression or distillation techniques consider potential leakage.

While the results have sparked debate—some researchers argue that numerical approximations, quantization, and stochasticity in large‑scale models could break strict injectivity—the GLADIA team emphasizes that their goal is theoretical insight rather than providing a practical attack vector.