Why Anthropic and OpenAI Are Adding ‘Dreaming’ to Their LLMs – Google’s Explanation
Anthropic and OpenAI have both introduced a Dreaming mechanism for their language models, and a recent Google paper explains that LLMs suffer anterograde amnesia; the proposed Sleep paradigm with memory consolidation and Dreaming dramatically improves continual learning, long‑context handling, math reasoning, and efficiency, as demonstrated by extensive benchmarks.
Anthropic and OpenAI Adopt Dreaming
Anthropic launched Dreaming at the Code with Claude conference, enabling Claude Managed Agents to review history after a session and boost task‑completion rates for the legal‑tech firm Harvey by 6× . OpenAI released ChatGPT Dreaming V3 on June 4, integrating backend memory for all users and raising memory‑recall success from 41.5% (manual) to 82.8% .
Core Insight: LLMs Have Anterograde Amnesia
Google’s paper argues that large language models can retain pre‑training knowledge (old memory) and learn temporarily via in‑context learning, but they forget newly acquired information as soon as a conversation ends—analogous to anterograde amnesia.
Sleep Paradigm – A “Sleep‑Dream” System for LLMs
The proposed Sleep paradigm mirrors human sleep, consisting of two stages:
Memory Consolidation (NREM‑like)
Distills high‑frequency short‑term updates from the Attention layer into low‑frequency long‑term updates in the MLP/FFN layer.
Uses parameter expansion to increase capacity and avoid catastrophic forgetting.
Applies Knowledge Seeding to perform upward distillation—small, frequently updated modules teach larger, stable modules.
Dreaming (REM‑like)
Generates synthetic “dream” data with reinforcement learning, allowing the model to rehearse new knowledge without supervision.
Introduces controlled noise via MoE router random expert selection to explore novel knowledge combinations.
Enforces a two‑stage isolation: first consolidate memory, then self‑improve, preventing direct overwriting of old knowledge.
Technical Core 1: Knowledge Seeding
The paper’s most innovative contribution is upward distillation, termed Generalized Knowledge Distillation (GKD) + Imitation Learning :
On‑Policy Distillation : the student generates data; the teacher provides token‑level feedback.
Learning to Imitate (LTI) : reinforcement learning trains the student to mimic the teacher, rewarding both semantic similarity and token‑level Levenshtein distance.
Technical Core 2: Dreaming Enhancements
Gradient‑Driven Data Selection : importance scores are computed for each generated dream; only the top‑k most valuable dreams are kept.
Random Expert Noise Injection : MoE routers randomly select unrelated experts during sampling, simulating REM‑stage exploratory connections.
Two‑Stage Isolation : memory consolidation precedes Dreaming, avoiding iterative self‑training that would overwrite existing knowledge.
Effectiveness of Sleep
4.1 Continual Learning – Countering Catastrophic Forgetting
On three text‑classification datasets, Sleep outperforms traditional regularization (EWC), external learners (InCA), and pure in‑context learning (ICL), converting temporary prompt‑level adaptation into lasting parameterized memory.
4.2 Long‑Context Understanding (128K → 10M tokens)
In the BABILong benchmark, Sleep maintains stable performance even as context length grows to 10 million tokens, because explicit memory consolidation turns short‑lived activations into compact parameter representations.
4.3 Mathematical Reasoning – Beyond SFT and GRPO
On AIME‑24/25 and HMMT‑25 benchmarks, Sleep achieves higher scores than supervised fine‑tuning (SFT) and gradient‑ranked policy optimization (GRPO).
4.4 Knowledge Integration – Persistent New Facts
In the SQuAD knowledge‑integration task, ablating the Dreaming stage drops performance to ~35 %, confirming Dreaming’s critical role.
4.5 Few‑Shot Abstract Reasoning (ARC)
Sleep reaches 80 % success on the ARC benchmark, surpassing ICL (0 %), TTT (10 %), and SEAL (72.5 %).
4.6 Efficiency Analysis – Sleep Is Not Slow
Although supervised fine‑tuning (SFT) runs 4× faster per step, achieving the same performance as Sleep requires 3.6–4.8× more wall‑clock time on AIME‑24/25 and HMMT‑25, making Sleep far more efficient from a performance‑oriented perspective.
Conclusion – Sleep as Essential LLM Infrastructure
Google’s paper frames Sleep not as an optional add‑on but as a necessary foundation for LLMs to evolve from static tools into continual learners. As the authors quote, “For a continual learner, there is no clear boundary between training and testing; the model only experiences two states: active/awake when receiving input, and isolated learning/sleep when consolidating.”
“For a continual learner, there is no clear boundary between training and testing. The model only needs to experience two states: receiving input while active/awake, and isolated learning while asleep.”
When LLMs learn to sleep and dream, they truly become alive.
https://arxiv.org/pdf/2606.03979
Language Models Need Sleep: Learning to Self-Modify and Consolidate MemoriesSigned-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
