How Mixture-of-Depths Attention Boosts Large Language Model Efficiency

This article examines the Mixture‑of‑Depths Attention (MoDA) mechanism, detailing its novel flash‑compatible KV layout, combined sequence‑depth attention, theoretical analysis, and extensive experiments that show significant reductions in validation loss and accuracy gains on downstream tasks compared to the OLMo2 baseline.

Data Party THU
Data Party THU
Data Party THU
How Mixture-of-Depths Attention Boosts Large Language Model Efficiency

Paper Overview

Mixture-of-Depths Attention (MoDA) proposes a novel attention mechanism that mitigates information dilution in deep large‑language models. The pre‑print is available at https://arxiv.org/pdf/2603.15619 and the reference implementation at https://github.com/hustvl/MoDA.

Key Innovations

Flash‑compatible depth‑KV layout: cross‑layer KV pairs of size T×L are flattened into a contiguous memory block, enabling block‑wise reads identical to FlashAttention kernels.

Fusion of sequence and depth attention: a single forward pass shares the online softmax state, eliminating intermediate storage. The fused kernel attains 97.3% of FlashAttention‑2 throughput at a 64K sequence length while preserving numerical precision.

Methodology

MoDA extends the standard Transformer decoder by allowing each query to attend simultaneously to (i) the current layer’s sequence KV and (ii) the depth KV accumulated from all preceding layers. A single softmax jointly normalizes both contributions, and the same projection parameters are reused across layers to improve optimization efficiency. The paper provides asymptotic complexity analysis showing MoDA’s overhead grows linearly with depth L and is comparable to existing depth‑flow methods.

To achieve hardware efficiency, the authors design:

Flash‑compatible KV flattening that stores depth KV contiguously.

Block‑aware and group‑aware indexing schemes that map the flattened layout to the FlashAttention‑2 kernel.

A fused kernel that reuses the online softmax buffer, avoiding extra memory traffic.

Architecture Illustration

The diagram below shows how a query vector can access both sequence KV and depth KV across layers, preserving shallow features in deeper layers.

MoDA visibility diagram
MoDA visibility diagram

Performance Evaluation (15B‑parameter Models)

Experiments use a 1.5B‑parameter decoder trained with the OLMo2 recipe. Compared to the OLMo2 baseline, MoDA consistently achieves lower C4 validation loss and higher accuracy on HellaSwag, WinoGrande, and ARC‑Challenge as the number of training tokens increases.

MoDA vs OLMo2 performance curves
MoDA vs OLMo2 performance curves

Conceptual Comparison of Depth‑Flow Mechanisms

The figure contrasts four strategies: deep residual, deep dense, deep attention, and MoDA. MoDA uniquely combines sequence and depth KV, enabling cross‑layer information aggregation without prohibitive computational cost.

Depth flow mechanisms comparison
Depth flow mechanisms comparison

Efficiency Experiments

Benchmarks were run on an NVIDIA A100 GPU in bfloat16 mode (batch size B=1, head dimension d=64, block size C=64). The following variables were varied:

Sequence length T : 4K → 65K reduces MoDA’s overhead from 25.86% to 2.73%.

GQA group size G : 2 → 32 raises depth‑utilization from 3.12% to 50% and cuts overhead from 27.07% to 2.84%.

Model depth L : 64 → 256 causes runtime to increase linearly, with overhead growing from 8.59% to 30.52%.

These results demonstrate that MoDA scales predictably and remains close to FlashAttention‑2 efficiency in long‑sequence, high‑utilization regimes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

TransformerFlashAttentionAttention MechanismDeep KVMixture-of-Depths Attention
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.