DecodeBatch Load Imbalance in LLM Inference: Request Length Differences Amplify

During LLM decoding, the DecodeBatch stage can suffer severe load imbalance because differing historical token lengths (kv_len) cause uneven attention task distribution across GPU SMs, a problem explored through detailed analysis of task granularity, SplitKV heuristics, FlashInfer’s batch‑size thresholds, and FA3’s dynamic scheduling and split strategies.

DecodeBatchFA3FlashInfer

0 likes · 29 min read

DecodeBatch Load Imbalance in LLM Inference: Request Length Differences Amplify