Artificial Intelligence 7 min read

Weekly AI Paper Digest: New Transformer Advances in Sparsity, Memory, and Reasoning

This article reviews five recent Transformer papers—including Engram's conditional memory, STEM's embedding‑based scaling, SeedFold's biomolecular structure prediction, a critique of Transformers for time‑series forecasting, and reasoning models as societies of thought—highlighting their methods, datasets, and performance gains.

HyperAI Super Neural

Jan 23, 2026

Weekly AI Paper Digest: New Transformer Advances in Sparsity, Memory, and Reasoning

Since the 2017 "Attention Is All You Need" paper, Transformers have reshaped AI research, becoming a universal paradigm across NLP, vision, speech, multimodal, and scientific computing. Industry leaders (Google, OpenAI, Meta, Microsoft) push scale and engineering, while academia (Stanford, MIT, Berkeley) drives theory, structural improvements, and new paradigms.

1. Engram: Conditional Memory via Scalable Lookup – Researchers from Peking University and DeepSeek‑AI introduce Engram, a conditional memory module with O(1) lookup. By extracting static knowledge retrieval from early Transformer layers and complementing it with MoE, early layers are freed for deeper reasoning. Engram improves BBH (+5.0), ARC‑Challenge (+3.7), HumanEval (+3.0), MATH (+2.4), and long‑context Multi‑Query NIAH (84.2 → 97.0) while keeping parameters and FLOPs constant.

2. STEM: Scaling Transformers with Embedding Modules – Carnegie Mellon University and Meta AI propose a static, token‑indexed sparse architecture called STEM. Embedding lookups replace the FFN up‑projection, reducing per‑token FLOPs and parameter access by ~⅓, and enabling asynchronous CPU off‑loading. The design decouples capacity from compute/communication, allowing larger knowledge storage and editable knowledge injection. Compared to dense baselines, STEM yields ~3–4% gains on knowledge and reasoning benchmarks.

3. SeedFold: Scaling Biomolecular Structure Prediction – ByteDance’s Seed team presents SeedFold, an extensible model that widens the Pairformer backbone and adopts a linear triangular attention to cut computation. Trained on a distilled dataset of 26.5 M samples (0.18 M experimental + distilled from AFDB & MGnify), SeedFold achieves state‑of‑the‑art results on FoldBench and surpasses AlphaFold 3 on protein‑related tasks.

4. Are Transformers Effective for Time Series Forecasting? – An analysis by Google, University of Chicago, and Santa Fe Institute finds that self‑attention’s permutation invariance discards crucial temporal order. Empirical comparisons show a simple single‑layer linear model outperforms complex Transformers on multiple real‑world time‑series datasets, challenging current research directions and urging a reassessment of Transformers for temporal tasks.

5. Reasoning Models Generate Societies of Thought – Researchers from Google, University of Chicago, and Santa Fe Institute argue that advanced reasoning models (e.g., DeepSeek‑R1, QwQ‑32B) succeed not merely due to longer reasoning chains but by implicitly simulating a "society of thought"—multiple internal personas engaging in dialogue. Through mechanism interpretability and controlled RL, they demonstrate causal links between conversational behaviors (questioning, conflict, reconciliation) and accuracy, and show that prompting "surprise" tokens can double performance, highlighting diversity and coordination as core to effective artificial reasoning.