Stem Sparse Attention Cuts First-Token Latency by 3.6× for Long-Context LLMs
The article introduces Tencent Hunyuan's Stem sparse‑attention algorithm, which reduces first‑token latency by 3.6× on 128K context LLMs by reallocating compute with Token Position Decay and Output‑Aware Metric, and validates the gains with HPC‑optimized operators that outperform existing sparse methods in extensive benchmarks.
Introduction
When feeding a tens‑of‑thousands‑word document to a large language model, the prefill stage can cause a long wait before the first token appears. This delay stems from the quadratic complexity of dense self‑attention, which grows with sequence length.
Stem Algorithm: Rethinking Sparse Attention
Core insight : In causal attention, the initial token acts as the "trunk" of the information flow, influencing every later token. Existing sparse methods allocate the same budget to all positions, ignoring the asymmetric importance of early tokens.
The Stem algorithm introduces two innovations:
Token Position Decay (TPD) : Budgets are linearly decayed from the initial position to the end, giving larger budgets to early tokens that carry critical recursive dependencies while aggressively pruning redundant information at later positions.
Output‑Aware Metric (OAM) : Instead of selecting tokens solely by attention scores (query‑key dot‑product), OAM multiplies the routing probability by the magnitude of the Value vector, capturing the true contribution of each token. A logarithmic transform turns the multiplication into addition, allowing the use of a standard Top‑k operator with near‑zero overhead.
These changes keep the total compute budget unchanged (≈25% of dense attention) while preserving near‑dense accuracy.
End‑to‑End Acceleration with HPC Operators
Stem is integrated into Tencent Hunyuan’s Hy3 preview (W8A8‑FP8) inference stack and paired with two HPC kernels:
HPC‑Stem : Merges OAM scoring and TPD block selection into a single kernel, eliminating the large intermediate tensors of prior implementations and reducing evaluation cost by roughly 64×.
HPC‑BSA : Designed for Hopper GPUs, this kernel pipelines data movement and computation, natively supports vLLM’s paged KV‑cache and FP8 quantization, and achieves near‑dense‑attention latency with almost zero jump‑block overhead.
Benchmark Results
Performance was measured against dense FP8 (HPC‑Dense) and FlashAttention V3 baselines, as well as open‑source sparse kernels MIT‑BSA (BF16) and FlashPrefill‑BSA (BF16). Key findings include:
At 50% sparsity, HPC‑BSA latency is about half of the dense baseline; at 80% sparsity it drops to roughly one‑fifth, with jump‑block overhead below 2.5%.
Across the full sparsity range, HPC‑BSA is ~3× faster than MIT‑BSA, thanks to FP8 throughput and Hopper‑specific optimizations.
The speedup remains stable for sequence lengths from 8K to 256K, demonstrating good long‑sequence scalability.
First‑token latency (TTFT) on a 128K context is reduced by 3.6× compared with dense inference.
Conclusion and Outlook
Stem’s algorithmic budget reallocation (TPD) and information‑aware token selection (OAM) achieve near‑lossless accuracy with only 25% of the compute, while the HPC‑Stem and HPC‑BSA kernels translate these theoretical gains into real hardware speedups. As LLM context windows expand toward the million‑token range, such full‑stack optimizations will become essential for efficient long‑text inference.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
