Can Attention Replace Residuals? Inside the New Attention Residuals Breakthrough

The article reviews the Kimi team's Attention Residuals approach, which substitutes traditional ResNet additive shortcuts with learned attention‑based weighting, explains the theoretical motivation linking depth to time, details full‑attention and block‑wise implementations, presents experimental results showing up to 1.25× compute efficiency and improved performance on reasoning and knowledge tasks.

Data Party THU
Data Party THU
Data Party THU
Can Attention Replace Residuals? Inside the New Attention Residuals Breakthrough

Background and Motivation

Since the introduction of ResNet in 2015, the simple additive shortcut "input plus output" has dominated neural network design, but it suffers from two major drawbacks as models grow deeper: (1) Information dilution – fixed‑weight averaging causes shallow features to be linearly attenuated across layers; (2) Hidden‑state explosion – maintaining signal strength forces deeper modules to produce larger activations, destabilizing training and gradient flow.

The Kimi team observes that model depth can be interpreted as a form of time, suggesting that the same attention mechanisms that replace recurrence in Transformers could replace additive residuals.

Attention Residuals Concept

Attention Residuals (AttnRes) introduce a learnable, input‑dependent attention weight for each previous layer, allowing the network to selectively aggregate past representations instead of uniformly summing them.

Full‑Attention Residuals

Each layer emits a learnable Query that is matched against Key vectors from all earlier layers. A softmax over the depth dimension yields attention weights that can heavily favor a few relevant layers (e.g., layer 50 may assign 0.8 weight to layer 2). This replaces the fixed‑weight addition of standard residuals.

Block‑wise Attention Residuals

Because full‑depth attention incurs O(L²) cost, the authors propose a block strategy:

Intra‑Block : The network is divided into N blocks; within a block, layers are summed to a single block representation.

Inter‑Block : Residual aggregation attends to whole‑block representations rather than individual layers. The value matrix for a layer in block n is defined over block‑level keys, reducing memory from O(L) to O(N).

This design keeps computational complexity manageable while preserving most of the performance gains.

Experimental Evaluation

The authors evaluate the method on a MoE‑based Transformer (Kimi Linear) that follows the Moonlight/DeepSeek‑V3 architecture. They train five model scales, each with three variants: a PreNorm baseline, Full AttnRes, and Block AttnRes (≈8 blocks). All other components (depth, hidden size, expert routing, MLP) remain unchanged.

Key findings:

Across all scales, AttnRes consistently achieves lower validation loss, yielding about a 1.25× compute advantage (e.g., 1.692 vs. 1.714 loss at 5.6 PFLOP·s‑days).

Block AttnRes reduces memory from O(L) to O(N) with less than 2 % inference latency increase.

Gradient magnitude becomes more uniform across depth, mitigating early‑layer gradient spikes seen in the baseline.

Downstream tasks show notable gains: GPQA‑Diamond (+7.5), Minerva Math (+3.6), code generation, MMLU (+1.1), TriviaQA (+1.9).

Overall compute consumption drops by roughly 20 % for comparable performance.

Figures illustrate the architecture overview, scaling curves, and training dynamics.

Conclusions

Attention Residuals reinterpret depth as a temporal dimension and replace naïve additive shortcuts with learned attention, achieving better efficiency, stability, and task performance. The block‑wise variant makes the approach practical for very deep models, suggesting a new direction for future architecture and even optimizer design.

Figure 1: Overview of Attention Residuals
Figure 1: Overview of Attention Residuals
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

deep learningTransformerattention mechanismModel EfficiencyResidual Networks
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.