How InfLLM‑V2 Delivers Fast, Low‑Cost Sparse Attention for Long‑Context LLMs

InfLLM‑V2 introduces a zero‑parameter, train‑efficient sparse‑attention framework that dramatically speeds up long‑sequence processing while requiring only 5 B tokens for training, and the open‑source MiniCPM4.1 model demonstrates comparable performance to dense attention on both long‑text understanding and deep‑thinking benchmarks.

Data Party THU
Data Party THU
Data Party THU
How InfLLM‑V2 Delivers Fast, Low‑Cost Sparse Attention for Long‑Context LLMs

1. Introduction

Long‑sequence processing is a major bottleneck for large‑language‑model (LLM) applications because dense self‑attention scales quadratically with sequence length. InfLLM‑V2, proposed by Tsinghua University and OpenBMB, is a native sparse‑attention framework that introduces no extra trainable parameters and can be trained efficiently on a modest amount of long‑text data.

2. Sparse‑attention mechanism

Standard Transformer attention computes the similarity between each query token Q[t] and all previous key tokens K[:t], which becomes infeasible for contexts of hundreds of thousands of tokens. Empirical analysis shows that most distant attention scores are near zero, indicating intrinsic sparsity. InfLLM‑V2 exploits this sparsity in two stages:

Block selection : The context is partitioned into fixed‑size key‑value blocks. For each query, a parameter‑free pooling operation produces a relevance score for every block; a Top‑K operation selects a small subset of blocks to attend to.

Sparse attention computation : Attention is performed only on the selected blocks, while the remaining blocks are ignored.

This design enables training with only 5 B long‑text tokens , whereas prior sparse‑attention methods such as DeepSeek‑V3.2‑Exp required close to 1 T tokens to achieve comparable capability.

3. Core advantages

Low‑cost training : The 5 B token requirement dramatically reduces data collection and compute expense.

Seamless short‑to‑long switching : Dense attention is used for short sequences; the same model automatically switches to the sparse path for long sequences without adding parameters.

Hardware‑friendly implementation : Optimized block‑selection kernels minimise HBM I/O and arithmetic, fully unlocking the speed potential of sparse attention.

4. Comparison with DeepSeek NSA

DeepSeek’s NSA architecture introduces three independent KV caches and three parallel attention branches, which cause instability during long‑sequence fine‑tuning and add overhead for short‑sequence tasks. InfLLM‑V2 replaces this with a single shared KV cache and a parameter‑free sparse path, aligning dense and sparse computations.

Key innovations:

Zero‑parameter block selection : Instead of an MLP‑based block compressor, InfLLM‑V2 uses a simple pooling operation to generate block scores, followed by a Top‑K selection.

Grouped‑query attention (GQA) shared Top‑K : GQA fuses query heads, allowing a single Top‑K operation to be shared across groups, which reduces kernel launch overhead.

5. Experimental results

5.1 Long‑text understanding

On the RULER, LongBench and LongPPL benchmarks, InfLLM‑V2 attains 98.1 % of the performance of a dense‑attention baseline while other sparse methods suffer noticeable drops.

5.2 Deep‑thinking tasks

For mathematical and code‑reasoning benchmarks, InfLLM‑V2 matches dense‑attention quality (≈99.7 % of dense performance), whereas NSA‑based models exhibit a large degradation.

5.3 Efficiency evaluation

Inference on NVIDIA A100 and RTX 4090 shows a 4–9× operator‑level speed‑up for 128 K‑token sequences. End‑to‑end measurements report ~2.1× speed‑up during prefill and ~2.3× during decode, confirming that the efficient block‑selection design is the primary acceleration source.

6. First open‑source native sparse‑attention model

InfLLM‑V2 was used to train MiniCPM4 (June 2024) and its improved variant MiniCPM4.1 (September 2024). MiniCPM4.1 achieves the highest average score among same‑size open‑source models on deep‑thinking benchmarks and runs roughly three times faster than comparable models such as Qwen‑3‑8B on tasks like LiveCodeBench and AIME, thanks to sparse attention and speculative sampling.

7. Future work

Planned directions include further optimisation of training and inference kernels, integration of InfLLM‑V2 into mainstream inference frameworks (e.g., SGLang), and open‑sourcing of the base model and long‑text training data to encourage broader research on sparse‑attention mechanisms.

Paper: https://arxiv.org/abs/2509.24663

Model repository: https://huggingface.co/openbmb/MiniCPM4.1-8B

InfLLM‑V2 overview diagram
InfLLM‑V2 overview diagram
Sparse attention block selection
Sparse attention block selection
Comparison of NSA and InfLLM‑V2
Comparison of NSA and InfLLM‑V2
efficiencyLarge Language Modelssparse attentionInfLLM-V2MiniCPM4.1
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.