Tagged articles
1 articles
Page 1 of 1
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Jun 8, 2026 · Artificial Intelligence

DecodeBatch Load Imbalance in LLM Inference: Request Length Differences Amplify

During LLM decoding, the DecodeBatch stage can suffer severe load imbalance because differing historical token lengths (kv_len) cause uneven attention task distribution across GPU SMs, a problem explored through detailed analysis of task granularity, SplitKV heuristics, FlashInfer’s batch‑size thresholds, and FA3’s dynamic scheduling and split strategies.

DecodeBatchFA3FlashInfer
0 likes · 29 min read
DecodeBatch Load Imbalance in LLM Inference: Request Length Differences Amplify