Artificial Intelligence 9 min read

How Mixture-of-Depths Attention Boosts Large Language Model Efficiency

This article examines the Mixture‑of‑Depths Attention (MoDA) mechanism, detailing its novel flash‑compatible KV layout, combined sequence‑depth attention, theoretical analysis, and extensive experiments that show significant reductions in validation loss and accuracy gains on downstream tasks compared to the OLMo2 baseline.

Data Party THU

Mar 26, 2026

How Mixture-of-Depths Attention Boosts Large Language Model Efficiency

Paper Overview

Mixture-of-Depths Attention (MoDA) proposes a novel attention mechanism that mitigates information dilution in deep large‑language models. The pre‑print is available at https://arxiv.org/pdf/2603.15619 and the reference implementation at https://github.com/hustvl/MoDA.

Key Innovations

Flash‑compatible depth‑KV layout: cross‑layer KV pairs of size T×L are flattened into a contiguous memory block, enabling block‑wise reads identical to FlashAttention kernels.

Fusion of sequence and depth attention: a single forward pass shares the online softmax state, eliminating intermediate storage. The fused kernel attains 97.3% of FlashAttention‑2 throughput at a 64K sequence length while preserving numerical precision.

Methodology

MoDA extends the standard Transformer decoder by allowing each query to attend simultaneously to (i) the current layer’s sequence KV and (ii) the depth KV accumulated from all preceding layers. A single softmax jointly normalizes both contributions, and the same projection parameters are reused across layers to improve optimization efficiency. The paper provides asymptotic complexity analysis showing MoDA’s overhead grows linearly with depth L and is comparable to existing depth‑flow methods.

To achieve hardware efficiency, the authors design:

Flash‑compatible KV flattening that stores depth KV contiguously.

Block‑aware and group‑aware indexing schemes that map the flattened layout to the FlashAttention‑2 kernel.

A fused kernel that reuses the online softmax buffer, avoiding extra memory traffic.

Architecture Illustration

The diagram below shows how a query vector can access both sequence KV and depth KV across layers, preserving shallow features in deeper layers.

Performance Evaluation (15B‑parameter Models)

Experiments use a 1.5B‑parameter decoder trained with the OLMo2 recipe. Compared to the OLMo2 baseline, MoDA consistently achieves lower C4 validation loss and higher accuracy on HellaSwag, WinoGrande, and ARC‑Challenge as the number of training tokens increases.

Conceptual Comparison of Depth‑Flow Mechanisms

The figure contrasts four strategies: deep residual, deep dense, deep attention, and MoDA. MoDA uniquely combines sequence and depth KV, enabling cross‑layer information aggregation without prohibitive computational cost.

Efficiency Experiments

Benchmarks were run on an NVIDIA A100 GPU in bfloat16 mode (batch size B=1, head dimension d=64, block size C=64). The following variables were varied:

Sequence length T : 4K → 65K reduces MoDA’s overhead from 25.86% to 2.73%.

GQA group size G : 2 → 32 raises depth‑utilization from 3.12% to 50% and cuts overhead from 27.07% to 2.84%.

Model depth L : 64 → 256 causes runtime to increase linearly, with overhead growing from 8.59% to 30.52%.

These results demonstrate that MoDA scales predictably and remains close to FlashAttention‑2 efficiency in long‑sequence, high‑utilization regimes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Transformer FlashAttention Attention Mechanism Deep KV Mixture-of-Depths Attention

Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.