A Comprehensive Guide to Major Attention Mechanisms: From MHA and GQA to MLA, Sparse and Hybrid Architectures
This article reviews and compares the most important attention variants used in modern large language models—including multi‑head attention, grouped‑query attention, multi‑head latent attention, sparse and sliding‑window attention, gated attention, and hybrid designs—detailing their motivations, memory trade‑offs, example architectures, and experimental findings.
Introduction
Renowned AI writer Sebastian Raschka recently published a visual guide to attention variants in large language models (LLMs). The guide, which has attracted significant community attention, aims to serve both as a reference and a lightweight learning resource.
1. Multi‑Head Attention (MHA)
Self‑attention allows each token to attend to all other visible tokens, assigning weights to build context‑aware representations. MHA implements this in Transformers by running multiple self‑attention heads in parallel, each with its own learned projection, and then concatenating the results.
Historical background: attention predates Transformers, originating from encoder‑decoder RNNs for translation. RNN hidden states cannot store unlimited context, leading to bottlenecks that attention resolves by allowing direct access to the entire input sequence.
Masking: In causal (decoder‑only) models, future tokens are masked, producing a triangular mask in the attention matrix.
Self‑attention internals:
Weight matrices W_q, W_k, W_v project input embeddings to queries, keys, and values.
Compute raw scores with QK^T.
Apply softmax to obtain the normalized attention matrix A.
Multiply A by V to produce the output matrix Z.
The single‑head pipeline (query‑key‑value projection → scaled dot‑product → softmax → weighted sum) is illustrated in Figure 7.
2. Grouped‑Query Attention (GQA)
Proposed by Joshua Ainslie et al. (2023), GQA shares a single key‑value projection across multiple query heads, reducing KV‑cache memory while keeping the overall attention pattern.
Benefits:
Lower memory cost because fewer KV heads are stored per layer.
Minimal implementation changes compared with MHA.
Typical use cases keep a moderate number of query heads (more than a pure multi‑query design) to balance memory savings and modeling quality, as shown in the comparison between 30B and 105B Sarvam models (Figure 12).
3. Multi‑Head Latent Attention (MLA)
MLA, introduced in the DeepSeek‑V2 paper, compresses the stored KV representation into a latent form, achieving greater memory reduction than GQA at the cost of added implementation complexity.
Ablation results (DeepSeek‑V2) show MLA can match or surpass MHA performance when carefully tuned, while still offering substantial KV savings.
MLA has been adopted in later DeepSeek releases (V3, V3.2) and propagated to models such as Kimi K2, GLM‑5, Ling 2.5, and the 105B Sarvam variant.
4. Sliding‑Window Attention (SWA)
SWA limits each token’s receptive field to a fixed local window, reducing quadratic memory and compute costs for long contexts. Models often mix local SWA layers with occasional global attention layers.
Gemma 3 serves as a clear example: it shifts the local‑to‑global layer ratio from 1:1 (Gemma 2) to 5:1 and reduces the window size from 4096 to 1024 tokens. Ablation studies indicate that more aggressive windowing has only a minor impact on perplexity.
Combining SWA with GQA is common because the two address different aspects of KV‑cache efficiency.
5. DeepSeek Sparse Attention (DSA)
DSA, appearing in DeepSeek V3.2 and GLM‑5, replaces the fixed window of SWA with a learned sparse pattern. A fast “indexer” scores past tokens, and a selector keeps only the top‑k scores, forming a sparse mask.
This approach retains the benefit of limiting attention to a subset of past tokens while allowing the model to learn which tokens are most relevant.
6. Gated Attention
Gated attention modifies the standard full‑attention block with three additions: an output gate scaling the residual, a zero‑centered QK‑Norm replacing RMSNorm, and a local RoPE variant. It appears in hybrid stacks, often alongside Gated DeltaNet, to provide more stable full‑attention layers without large computational overhead.
7. Hybrid Attention
Hybrid attention combines cheap linear or state‑space sequence modules (e.g., Mamba‑2, Lightning Attention, DeltaNet) for most layers with occasional full‑attention or gated‑attention layers, targeting long‑context efficiency.
Examples:
Qwen 3‑Next mixes three Gated DeltaNet blocks with one gated‑attention block (3:1 ratio).
Kimi Linear retains the 3:1 pattern but replaces Gated DeltaNet with a channel‑wise gated version and swaps the gated‑attention block for a gated MLA layer.
Ling 2.5 uses Lightning Attention for the lightweight part while keeping DeepSeek’s MLA for heavy layers.
Nemotron 3 Nano pushes the hybrid idea further by using Mamba‑2 for most sequence modeling and only a few self‑attention layers.
Conclusion
The article focuses on attention variants that are currently deployed in open‑weight state‑of‑the‑art LLMs. It highlights the trade‑offs between memory efficiency and modeling performance, notes that hybrid architectures are gaining traction for long‑context tasks, and anticipates future developments such as Mamba‑3 layers and widespread use of attention residuals.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
