Residual Connections — 5 Technical Articles

Machine Learning Algorithms & Natural Language Processing

Apr 19, 2026 · Artificial Intelligence

FlashDepthAttention and Mixed Depth Attention: The Next Phase of Large Model Architecture

The article argues that after a decade of scaling large language models by widening, deepening, and adding data, the real bottleneck now lies in inter‑layer communication, and it presents FlashDepthAttention and MoDA as efficient retrieval‑based mechanisms that replace additive residual connections, improve depth utilization, and boost model performance.

FlashDepthAttentionMoDAResidual Connections

0 likes · 15 min read

FlashDepthAttention and Mixed Depth Attention: The Next Phase of Large Model Architecture

Machine Learning Algorithms & Natural Language Processing

Mar 20, 2026 · Artificial Intelligence

Why Kimi Dropped Residual Connections: A First‑Person Deep Dive into Attention Residuals

This article explains how Attention Residuals (AttnRes) replace traditional residual shortcuts with layer‑wise attention, details the mathematical reformulation, design constraints, static‑Q trick, full and block variants, and presents experimental evidence of significant accuracy gains with modest overhead.

AttentionEfficient AttentionNLP

0 likes · 11 min read

Why Kimi Dropped Residual Connections: A First‑Person Deep Dive into Attention Residuals

PaperAgent

Mar 17, 2026 · Artificial Intelligence

Can Attention Replace Fixed Residuals? Inside the ‘Attention Residuals’ Breakthrough

This article analyzes the newly released Attention Residuals paper, explaining how learnable attention weighting replaces fixed residual addition to mitigate information dilution in deep LLMs, detailing the proposed Block AttnRes design, engineering trade‑offs, experimental results, and its significance for foundational model architecture.

AttentionBlock AttentionLLM

0 likes · 9 min read

Can Attention Replace Fixed Residuals? Inside the ‘Attention Residuals’ Breakthrough

AI Architecture Hub

Jan 2, 2026 · Artificial Intelligence

How Manifold-Constrained Hyper-Connections Boost LLM Performance with Minimal Overhead

DeepSeek's new mHC architecture projects residual connections onto a manifold, enabling a 6.7% training cost increase for 27B models while delivering significant stability and downstream performance gains over traditional residual and hyper‑connection designs.

LLMManifold OptimizationPipeline Parallelism

0 likes · 13 min read

How Manifold-Constrained Hyper-Connections Boost LLM Performance with Minimal Overhead

AI Large Model Application Practice

May 16, 2025 · Artificial Intelligence

Why Residual Connections Keep Deep Neural Networks Stable

This article explains why residual connections are essential in deep neural networks, describing the problems of network degradation and gradient vanishing, how shortcut paths add the input to the layer output, the requirement of matching dimensions, and the resulting stability for training large language models.

LLMResidual Connectionsgradient flow

0 likes · 7 min read

Why Residual Connections Keep Deep Neural Networks Stable