Artificial Intelligence 15 min read

FlashDepthAttention and Mixed Depth Attention: The Next Phase of Large Model Architecture

The article argues that after a decade of scaling large language models by widening, deepening, and adding data, the real bottleneck now lies in inter‑layer communication, and it presents FlashDepthAttention and MoDA as efficient retrieval‑based mechanisms that replace additive residual connections, improve depth utilization, and boost model performance.

Machine Learning Algorithms & Natural Language Processing

Apr 19, 2026

FlashDepthAttention and Mixed Depth Attention: The Next Phase of Large Model Architecture

Over the past ten years, progress in large language models has followed a simple scaling law: increase parameters, data, and context length, which reliably lowers loss. However, this scaling has largely ignored the communication between layers; residual connections still use the unchanged "x + F(x)" formulation, leading to information dilution as depth grows.

The "first half" of model scaling focused on extending the sequence dimension with sparse, linear, and hybrid attention (e.g., FlashAttention) and improved positional encodings (RoPE scaling), enabling models such as OpenAI‑O1 and DeepSeek‑R1 to handle 128K‑plus tokens.

When depth is increased—models with 32, 64, or even 100+ layers—the residual pathway remains unchanged, causing later layers to struggle to retrieve useful signals from earlier ones. The article illustrates this with a telephone‑game analogy, showing how accumulated noise makes it hard for a deep layer to hear the original message.

Previous attempts to alleviate the depth bottleneck include DenseNet’s dense connections, Hyper‑Connections, MUDDFormer’s dynamic mixing, and various learned weighting schemes (DenseFormer, LIMe). While these methods improve information flow, they still treat inter‑layer communication as an additive process.

The authors propose reframing inter‑layer communication as a retrieval problem rather than accumulation. By casting each layer’s output as a query, key, and value, the model can directly retrieve relevant information from any previous layer. Independent works such as Google’s DCA, Huawei’s MRLA, Hessian.AI’s Dreamer, and Kimi’s AttnRes have converged on this idea, indicating a correct research direction.

Implementing depth‑wise attention naïvely in PyTorch proved prohibitively slow (≈45 s per forward‑backward pass). Flash Depth Attention overcomes this by reorganizing the data layout to match GPU memory patterns, achieving orders‑of‑magnitude speedups while preserving full expressive power.

Flash Depth Attention (FDA) and its mixture‑of‑depths extension (MoDA) combine depth and sequence retrieval in a single softmax. Each attention head attends simultaneously to the current layer’s sequence KV pairs and the KV pairs of all preceding layers. This changes the backbone pipeline from Res → Seq‑Attn → Res → FFN to Depth‑Attn → Seq‑Attn → Depth‑Attn → FFN, with shared queries but distinct keys/values for sequence and depth.

Empirical results on the OLMo2 baseline show that MoDA reduces the “attention sink” phenomenon—where probability mass concentrates on a few tokens—and encourages deeper layers to actively retrieve useful information, leading to consistent performance gains.

The authors argue that any neural component that currently passes information through static, data‑independent channels (between layers, modalities, or time steps) could benefit from a retrieval‑based redesign, opening a new research frontier beyond merely widening or deepening networks.

In summary, the “second half” of large‑model architecture shifts focus from scaling components to scaling communication, and Flash Depth Attention together with MoDA provide a practical, fast, and effective solution.

Paper: https://arxiv.org/abs/2603.15619<br/>Code: https://github.com/hustvl/MoDA<br/>Lab: https://github.com/hustvl

large language models MoDA Residual Connections depth attention FlashDepthAttention neural network architecture