Artificial Intelligence 7 min read

Combining Transformers and RNNs: Google’s Memory Caching Unlocks Ultra‑Long Context

Google Research introduces Memory Caching (MC), a technique that gives RNNs growing memory capacity, bridging the gap with Transformers to enable ultra‑long context processing while reducing memory demands, and demonstrates its effectiveness through extensive language‑modeling and recall experiments.

Machine Heart

Apr 17, 2026

Combining Transformers and RNNs: Google’s Memory Caching Unlocks Ultra‑Long Context

Google Research recently published a paper titled Memory Caching: RNNs with Growing Memory (arXiv:2602.24281) that proposes a new architecture‑level solution to the memory bottleneck of large‑scale Transformers when handling very long texts.

Transformers achieve strong long‑context recall because they cache every token, but their quadratic attention cost makes memory and compute expensive. In contrast, recurrent models such as RNNs, linear‑attention models, and state‑space models have fixed‑size hidden states with linear cost, yet they must compress all past information into a single hidden vector, causing severe information loss on recall‑intensive tasks.

The authors introduce Memory Caching (MC) , which periodically snapshots hidden states of an RNN and stores them in a long‑term cache. During inference the model can retrieve relevant cached snapshots in addition to the online hidden state, effectively giving the RNN a “growing memory” while keeping per‑token decoding cost low.

Three concrete MC variants are explored:

Gated Residual Memory : queries retrieve relevant past information, which is then pooled with an attention‑like operation; decoding cost grows with effective memory size.

Memory Soup : aggregates the weights of past memory slots rather than query‑specific outputs, also leading to growing cost.

Sparse Selective Caching (SSC) : sparsely selects a subset of cached memories for each token, achieving a trade‑off where effective memory grows but per‑token cost remains roughly constant.

Experiments on a 1.3‑billion‑parameter model cover language modeling, dense‑recall, long‑context, and needle‑in‑a‑haystack tasks. The results show that adding MC consistently improves performance, narrows the gap to Transformers on in‑context recall, and outperforms other recurrent baselines, although Transformers still hold the highest accuracy on the most demanding dense‑recall benchmarks.

Overall, the study demonstrates that a simple caching intuition can substantially enhance RNN‑based architectures, offering a viable path toward ultra‑long context processing without the prohibitive memory costs of pure Transformers.