Kimi’s Attention Residuals Swap a Decade-Old Residual Trick for 1.25× Faster 48B MoE

The Kimi team introduces Attention Residuals, a softmax‑based replacement for the uniform residual connections used in Transformers for a decade, enabling selective aggregation of layer histories, reducing hidden‑state growth, and achieving a 1.25× compute‑efficiency gain on a 48‑billion‑parameter MoE model with less than 2% inference latency increase.

ShiZhen AI
ShiZhen AI
ShiZhen AI
Kimi’s Attention Residuals Swap a Decade-Old Residual Trick for 1.25× Faster 48B MoE

Residual Connections: A Democratic but Flawed Design

Transformer layers add their output back to the next layer’s input via residual connections, a practice inherited from ResNet in 2015 and used unchanged in almost all large models. The design gives every layer equal weight, so the first layer’s contribution is identical to that of the 47th layer.

This equal weighting creates two engineering headaches: PreNorm dilution , where early‑layer information is increasingly drowned out as depth grows, and unbounded hidden‑state growth , where the cumulative sum of layer outputs expands, destabilising training.

Committee analogy: equal weight vs selective choice
Committee analogy: equal weight vs selective choice

AttnRes: Using Attention to Handle "Own History"

The proposed Attention Residuals (AttnRes) apply the same selective attention mechanism that Transformers use across tokens, but in the depth dimension: each layer learns a "pseudo‑query" vector that produces softmax weights \(\alpha_{i\to l}\) for aggregating previous layer outputs.

Standard residuals compute h_l = \sum_{i=0}^{l-1} h_i AttnRes replaces the uniform weight 1 with dynamic softmax attention:

h_l = \sum_{i=0}^{l-1} \alpha_{i \to l} \cdot h_i

This makes the aggregation input‑dependent, limits output magnitude, yields a more uniform gradient distribution, and prevents early‑layer information from being lost.

AttnRes principle comparison
AttnRes principle comparison

Engineering Challenge: Memory Consumption

Having every layer attend to all previous layers would require O(Ld) memory (L = number of layers, d = hidden dimension), which is infeasible for a 48‑billion‑parameter model.

Kimi solves this with Block AttnRes : the model is divided into N≈8 blocks. Within a block, standard residual accumulation is kept; across blocks, attention operates only on block‑level summary representations, not on every layer’s output.

This reduces memory from O(Ld) to O(Nd). The extra memory cost of eight blocks is negligible, and experiments show that Block AttnRes recovers most of the performance gain of full‑layer AttnRes.

Block AttnRes block diagram
Block AttnRes block diagram

Implementation

def block_attn_res(blocks, partial_block, proj, norm):
    V = torch.stack(blocks + [partial_block])  # [N+1, B, T, D]
    K = norm(V)
    logits = torch.einsum('d, n b t d -> n b t', proj.weight.squeeze(), K)
    h = torch.einsum('n b t, n b t d -> b t d', logits.softmax(0), V)
    return h

The overall architecture stays unchanged; two einsum operations solve the problem, and inference latency increases by less than 2%.

Block AttnRes architecture with 8 groups
Block AttnRes architecture with 8 groups

Experimental Results: What Does 1.25× Mean?

Kimi Linear (48B parameters, 3B active MoE) was pretrained on 1.4 trillion tokens. Compared with standard residuals, Block AttnRes achieved:

1.25× compute‑efficiency : the same downstream performance required only 80% of the compute, equivalent to a 25% increase in effective compute for the same hardware.

Training dynamics showed controlled hidden‑state magnitude growth and more uniform gradient distribution, leading to more stable training and reduced hyper‑parameter tuning effort.

Scaling‑law experiments confirmed the improvement across multiple model sizes, indicating the effect is not limited to a single scale.

Training dynamics comparison
Training dynamics comparison

Broader Context

Historically, deep‑learning architecture has repeatedly replaced fixed rules with attention: self‑attention supplanted RNN sequencing, MoE gating replaced static layer routing, and AttnRes now replaces fixed residual accumulation. Some view this as the Transformer finally applying attention along every dimension.

Critics note that while the 1.25× gain is tangible, the scaling‑law validation used a relatively small model (48B with 3B active), and real‑world deployment may require further exploration of block count and size.

Architecture evolution from RNN to Transformer to AttnRes
Architecture evolution from RNN to Transformer to AttnRes

Conclusion

AttnRes transforms the once‑static residual connection into a learnable component, delivering a 25% compute‑efficiency boost on a 48B MoE model with negligible latency impact. If reproduced at larger scales, this technique is likely to become a standard upgrade in future large‑model pipelines.

AttnRes vs standard residual: selective weights vs equal sum
AttnRes vs standard residual: selective weights vs equal sum
deep learningTransformerMoECompute EfficiencyResidual ConnectionAttention Residuals
ShiZhen AI
Written by

ShiZhen AI

Tech blogger with over 10 years of experience at leading tech firms, AI efficiency and delivery expert focusing on AI productivity. Covers tech gadgets, AI-driven efficiency, and leisure— AI leisure community. 🛰 szzdzhp001

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.