Can Attention Replace Fixed Residuals? Inside the ‘Attention Residuals’ Breakthrough

This article analyzes the newly released Attention Residuals paper, explaining how learnable attention weighting replaces fixed residual addition to mitigate information dilution in deep LLMs, detailing the proposed Block AttnRes design, engineering trade‑offs, experimental results, and its significance for foundational model architecture.

PaperAgent
PaperAgent
PaperAgent
Can Attention Replace Fixed Residuals? Inside the ‘Attention Residuals’ Breakthrough

01 The Blind Spot of Residual Connections

Modern large language models rely on residual connections introduced by He et al. in 2015 to allow gradients to flow unimpeded through deep networks. However, each layer’s output is added equally to the next, effectively creating an equal‑weight sum of all previous layers. This leads to two problems: the magnitude of hidden states grows with depth, diluting early‑layer information, and the network cannot selectively revisit earlier representations, as demonstrated by experiments showing many layers can be pruned with little performance loss.

The issue mirrors the time‑dimension compression in RNNs, which Transformers solved for token‑level attention. The question arises: can attention also address depth‑wise information compression?

02 Kimi’s Answer: Turning Depth into Attention

Kimi’s report introduces Attention Residuals (AttnRes) , replacing the fixed addition with a learnable, input‑dependent attention weighting between layers. The attention weight for "layer‑to‑layer" follows the same softmax formulation as Transformer attention, but the query vector is a separate learnable parameter rather than the previous layer’s output.

This mechanism lets each layer actively choose which previous layers are most informative. Visualizations of the learned attention matrix reveal that:

Each layer still primarily attends to its immediate predecessor, preserving local information flow.

Some layers skip intermediate steps to attend to much earlier layers.

The embedding layer retains a consistently high weight, acting as a foundational signal.

Attention layers distribute focus more broadly, while MLP layers rely more heavily on the nearest previous layer.

Thus, the network learns to route information across depth rather than merely adding it.

03 Engineering the Idea: Block AttnRes

Applying full attention across all layers would incur O(L²) memory and communication costs, which become prohibitive during distributed training and inference. Kimi mitigates this by grouping layers into blocks (e.g., six layers per block). Within each block, standard residual addition collapses the block’s outputs into a single vector; attention is then performed across block representations only. This reduces the stored state from O(L) to O(B), where B is the number of blocks, and experiments show that around eight blocks capture most of the performance gain.

Additional engineering tricks include:

During training, a cross‑stage cache in pipeline parallelism prevents redundant data movement between virtual stages.

During inference, a two‑stage process first computes block‑level attention, then incrementally integrates intra‑block information using an online softmax, adding less than 2% latency.

The overall training overhead is negligible.

04 Empirical Results

Kimi conducted scaling‑law experiments on five model sizes with three variants: baseline, Block AttnRes, and Full AttnRes, keeping hyper‑parameters fixed to favor the baseline. The AttnRes curves consistently achieve lower validation loss, and the Block variant provides roughly a 1.25× compute advantage at the 5.6 PFLOP/s‑day operating point.

Integrating Full AttnRes into the Kimi Linear architecture (48 B parameters, 3 B activations) and training on 1.4 T tokens yields notable downstream improvements on benchmarks such as MMLU and GPQA, especially for reasoning and code tasks. Ablation studies confirm the necessity of each design component, and architecture‑search heatmaps demonstrate that AttnRes enables more effective utilization of depth.

05 Final Thoughts

While many recent LLM advances focus on top‑level components—more efficient attention kernels, smarter MoE routing, or better training recipes—Kimi’s work revisits the foundational residual connection introduced a decade ago. Alongside DeepSeek’s contemporaneous mHC paper, which learns residual weights, both Chinese teams demonstrate that the long‑standing "default" residual design can still be fundamentally rethought.

Title: Attention Residuals
Paper: https://github.com/MoonshotAI/Attention-Residuals/blob/master/Attention_Residuals.pdf
DeepSeek mHC: https://arxiv.org/pdf/2512.24880
ResNet: https://arxiv.org/abs/1512.03385
deep learningLLMAttentionmodel architectureResidual ConnectionsBlock Attention
PaperAgent
Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.