FlashDepthAttention and Mixed Depth Attention: The Next Phase of Large Model Architecture

The article argues that after a decade of scaling large language models by widening, deepening, and adding data, the real bottleneck now lies in inter‑layer communication, and it presents FlashDepthAttention and MoDA as efficient retrieval‑based mechanisms that replace additive residual connections, improve depth utilization, and boost model performance.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
FlashDepthAttention and Mixed Depth Attention: The Next Phase of Large Model Architecture

Over the past ten years, progress in large language models has followed a simple scaling law: increase parameters, data, and context length, which reliably lowers loss. However, this scaling has largely ignored the communication between layers; residual connections still use the unchanged "x + F(x)" formulation, leading to information dilution as depth grows.

The "first half" of model scaling focused on extending the sequence dimension with sparse, linear, and hybrid attention (e.g., FlashAttention) and improved positional encodings (RoPE scaling), enabling models such as OpenAI‑O1 and DeepSeek‑R1 to handle 128K‑plus tokens.

When depth is increased—models with 32, 64, or even 100+ layers—the residual pathway remains unchanged, causing later layers to struggle to retrieve useful signals from earlier ones. The article illustrates this with a telephone‑game analogy, showing how accumulated noise makes it hard for a deep layer to hear the original message.

Previous attempts to alleviate the depth bottleneck include DenseNet’s dense connections, Hyper‑Connections, MUDDFormer’s dynamic mixing, and various learned weighting schemes (DenseFormer, LIMe). While these methods improve information flow, they still treat inter‑layer communication as an additive process.

The authors propose reframing inter‑layer communication as a retrieval problem rather than accumulation. By casting each layer’s output as a query, key, and value, the model can directly retrieve relevant information from any previous layer. Independent works such as Google’s DCA, Huawei’s MRLA, Hessian.AI’s Dreamer, and Kimi’s AttnRes have converged on this idea, indicating a correct research direction.

Implementing depth‑wise attention naïvely in PyTorch proved prohibitively slow (≈45 s per forward‑backward pass). Flash Depth Attention overcomes this by reorganizing the data layout to match GPU memory patterns, achieving orders‑of‑magnitude speedups while preserving full expressive power.

Flash Depth Attention (FDA) and its mixture‑of‑depths extension (MoDA) combine depth and sequence retrieval in a single softmax. Each attention head attends simultaneously to the current layer’s sequence KV pairs and the KV pairs of all preceding layers. This changes the backbone pipeline from Res → Seq‑Attn → Res → FFN to Depth‑Attn → Seq‑Attn → Depth‑Attn → FFN, with shared queries but distinct keys/values for sequence and depth.

Flash Depth Attention architecture
Flash Depth Attention architecture

Empirical results on the OLMo2 baseline show that MoDA reduces the “attention sink” phenomenon—where probability mass concentrates on a few tokens—and encourages deeper layers to actively retrieve useful information, leading to consistent performance gains.

The authors argue that any neural component that currently passes information through static, data‑independent channels (between layers, modalities, or time steps) could benefit from a retrieval‑based redesign, opening a new research frontier beyond merely widening or deepening networks.

In summary, the “second half” of large‑model architecture shifts focus from scaling components to scaling communication, and Flash Depth Attention together with MoDA provide a practical, fast, and effective solution.

Paper: https://arxiv.org/abs/2603.15619<br/>Code: https://github.com/hustvl/MoDA<br/>Lab: https://github.com/hustvl

large language modelsMoDAResidual Connectionsdepth attentionFlashDepthAttentionneural network architecture
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.