Why Kimi Dropped Residual Connections: A First‑Person Deep Dive into Attention Residuals
This article explains how Attention Residuals (AttnRes) replace traditional residual shortcuts with layer‑wise attention, details the mathematical reformulation, design constraints, static‑Q trick, full and block variants, and presents experimental evidence of significant accuracy gains with modest overhead.
Motivation and Overview
Attention Residuals (AttnRes) are introduced as a principled way to rewrite the inter‑layer connections of deep neural networks using attention instead of the classic residual addition. The authors argue that residuals are essentially an equal‑weight sum of layer outputs, which can be generalized to a weighted‑sum formulation and then to a full attention mechanism.
From Residuals to Hyper‑Connections (HC)
The article revisits the long‑standing Pre‑Norm/Post‑Norm debate, noting that many subsequent normalisation tricks are merely variations of residuals. It then describes HC, a design that expands the residual stream (k‑fold) but suffers from instability, and its improved version mHC used by DeepSeek, which adds constraints to keep contributions directionally consistent and non‑negative.
Deriving Attention Residuals
Starting from the standard residual form, the authors rewrite it as a weighted sum and show that this is equivalent to an attention operation when the weight matrix is interpreted as an attention matrix. Two constraints are imposed: (1) the contribution of each layer must have the same sign to avoid conflicting updates, and (2) the use of RMSNorm ensures that weighted averaging and weighted summation are mathematically identical, preserving expressive power.
First‑Version AttnRes
The initial design uses a static, data‑independent query vector (Q) while the key (K) and value (V) are derived from the layer outputs. This simple Softmax‑based attention already yields a noticeable improvement over vanilla residuals. Collaborative experiments with teammates (Zhang Yu, Guang Yu, etc.) confirmed the benefit on larger models.
Stability Enhancements
Adding RMSNorm to the attention path stabilises training and inference. The static Q enables pre‑computation of future‑to‑past attention, freeing up inference bandwidth. However, on the current training infrastructure the full‑dense AttnRes still incurs noticeable memory and communication costs.
Block AttnRes: Compression for Efficiency
To reduce overhead, the authors propose a block‑wise version. The embedding layer is isolated as its own block because visualising its attention matrix shows a strong pattern. Remaining layers are grouped into ~8 blocks; each block first compresses its internal representations by summation, then performs attention across blocks. Experiments show that this design adds less than 5 % extra cost while delivering roughly 25 % accuracy gain, making it practical for K‑scale models.
Matrix Perspective
All variants—Residuals, HC/mHC, Full AttnRes, Block AttnRes—can be expressed as special cases of an attention matrix. This unified view clarifies how each method relates to the others and highlights the flexibility of the attention formulation.
Related Work
The paper surveys dense‑connection and depth‑attention literature, citing DenseNet[11], DenseFormer[12], ANCRe[13], MUDDFormer[14], MRLA[15], Dreamer[16], ELMo[17], SKNets[18], LIMe[19], DCA[20] and others. It acknowledges that many prior works explored similar ideas, but claims AttnRes is the first to achieve a scalable, efficient replacement for residuals in very large models.
Conclusion
AttnRes demonstrates that layer‑wise attention can serve as a strong alternative to residual connections, satisfying both training and inference efficiency requirements and scaling to large‑scale NLP models.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
