Jun 5, 2026 · Artificial Intelligence

Stem Sparse Attention Cuts First-Token Latency by 3.6× for Long-Context LLMs

The article introduces Tencent Hunyuan's Stem sparse‑attention algorithm, which reduces first‑token latency by 3.6× on 128K context LLMs by reallocating compute with Token Position Decay and Output‑Aware Metric, and validates the gains with HPC‑optimized operators that outperform existing sparse methods in extensive benchmarks.

HPC OperatorsLLM InferenceOutput-Aware Metric

0 likes · 11 min read

Stem Sparse Attention Cuts First-Token Latency by 3.6× for Long-Context LLMs