Tagged articles
2 articles
Page 1 of 1
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Jun 8, 2026 · Artificial Intelligence

DecodeBatch Load Imbalance in LLM Inference: Request Length Differences Amplify

During LLM decoding, the DecodeBatch stage can suffer severe load imbalance because differing historical token lengths (kv_len) cause uneven attention task distribution across GPU SMs, a problem explored through detailed analysis of task granularity, SplitKV heuristics, FlashInfer’s batch‑size thresholds, and FA3’s dynamic scheduling and split strategies.

DecodeBatchFA3FlashInfer
0 likes · 29 min read
DecodeBatch Load Imbalance in LLM Inference: Request Length Differences Amplify
MaGe Linux Operations
MaGe Linux Operations
Jan 6, 2026 · Artificial Intelligence

How SGLang Boosted LLM Inference on H800 GPUs to 420 Tokens/s

This guide details how switching from vLLM to SGLang on eight NVIDIA H800 GPUs increased Llama‑3‑70B‑Instruct throughput from 180 to 420 tokens per second, covering SGLang’s core innovations, environment setup, configuration tweaks, performance benchmarks, troubleshooting tips, and production‑grade deployment scripts.

FlashInferGPU optimizationH800
0 likes · 19 min read
How SGLang Boosted LLM Inference on H800 GPUs to 420 Tokens/s