Attention Residuals — 3 Technical Articles

Apr 3, 2026 · Artificial Intelligence

Kimi’s ‘Option Time Machine’: Interns Gain Equity While Building Cutting‑Edge AI

Kimi, a three‑year‑old AI‑native unicorn valued over $120 billion, launches a “Time‑Machine” option program that grants interns equity while showcasing its rapid valuation growth, record‑breaking context lengths, novel Kimi Linear architecture, token‑efficiency gains, and open‑source models that rival leading LLMs.

AI Talent ProgramAgent SwarmsAttention Residuals

0 likes · 10 min read

Kimi’s ‘Option Time Machine’: Interns Gain Equity While Building Cutting‑Edge AI

SuanNi

Mar 17, 2026 · Artificial Intelligence

How Attention Residuals Boost Transformer Efficiency and Scale

The article presents the Attention Residuals architecture, explains how it replaces uniform residual addition with learned attention‑based aggregation, details full and block variants, engineering tricks for distributed training, and shows extensive scaling‑law experiments where the new design consistently improves validation loss and training efficiency across model sizes.

Attention ResidualsTransformerdeep learning

0 likes · 13 min read

How Attention Residuals Boost Transformer Efficiency and Scale

ShiZhen AI

Mar 17, 2026 · Artificial Intelligence

Kimi’s Attention Residuals Swap a Decade-Old Residual Trick for 1.25× Faster 48B MoE

The Kimi team introduces Attention Residuals, a softmax‑based replacement for the uniform residual connections used in Transformers for a decade, enabling selective aggregation of layer histories, reducing hidden‑state growth, and achieving a 1.25× compute‑efficiency gain on a 48‑billion‑parameter MoE model with less than 2% inference latency increase.

Attention ResidualsCompute EfficiencyMoE

0 likes · 10 min read

Kimi’s Attention Residuals Swap a Decade-Old Residual Trick for 1.25× Faster 48B MoE