How Mixture-of-Depths Attention Boosts Large Language Model Efficiency
This article examines the Mixture‑of‑Depths Attention (MoDA) mechanism, detailing its novel flash‑compatible KV layout, combined sequence‑depth attention, theoretical analysis, and extensive experiments that show significant reductions in validation loss and accuracy gains on downstream tasks compared to the OLMo2 baseline.
Paper Overview
Mixture-of-Depths Attention (MoDA) proposes a novel attention mechanism that mitigates information dilution in deep large‑language models. The pre‑print is available at https://arxiv.org/pdf/2603.15619 and the reference implementation at https://github.com/hustvl/MoDA.
Key Innovations
Flash‑compatible depth‑KV layout: cross‑layer KV pairs of size T×L are flattened into a contiguous memory block, enabling block‑wise reads identical to FlashAttention kernels.
Fusion of sequence and depth attention: a single forward pass shares the online softmax state, eliminating intermediate storage. The fused kernel attains 97.3% of FlashAttention‑2 throughput at a 64K sequence length while preserving numerical precision.
Methodology
MoDA extends the standard Transformer decoder by allowing each query to attend simultaneously to (i) the current layer’s sequence KV and (ii) the depth KV accumulated from all preceding layers. A single softmax jointly normalizes both contributions, and the same projection parameters are reused across layers to improve optimization efficiency. The paper provides asymptotic complexity analysis showing MoDA’s overhead grows linearly with depth L and is comparable to existing depth‑flow methods.
To achieve hardware efficiency, the authors design:
Flash‑compatible KV flattening that stores depth KV contiguously.
Block‑aware and group‑aware indexing schemes that map the flattened layout to the FlashAttention‑2 kernel.
A fused kernel that reuses the online softmax buffer, avoiding extra memory traffic.
Architecture Illustration
The diagram below shows how a query vector can access both sequence KV and depth KV across layers, preserving shallow features in deeper layers.
Performance Evaluation (15B‑parameter Models)
Experiments use a 1.5B‑parameter decoder trained with the OLMo2 recipe. Compared to the OLMo2 baseline, MoDA consistently achieves lower C4 validation loss and higher accuracy on HellaSwag, WinoGrande, and ARC‑Challenge as the number of training tokens increases.
Conceptual Comparison of Depth‑Flow Mechanisms
The figure contrasts four strategies: deep residual, deep dense, deep attention, and MoDA. MoDA uniquely combines sequence and depth KV, enabling cross‑layer information aggregation without prohibitive computational cost.
Efficiency Experiments
Benchmarks were run on an NVIDIA A100 GPU in bfloat16 mode (batch size B=1, head dimension d=64, block size C=64). The following variables were varied:
Sequence length T : 4K → 65K reduces MoDA’s overhead from 25.86% to 2.73%.
GQA group size G : 2 → 32 raises depth‑utilization from 3.12% to 50% and cuts overhead from 27.07% to 2.84%.
Model depth L : 64 → 256 causes runtime to increase linearly, with overhead growing from 8.59% to 30.52%.
These results demonstrate that MoDA scales predictably and remains close to FlashAttention‑2 efficiency in long‑sequence, high‑utilization regimes.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
