Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Nov 19, 2025 · Artificial Intelligence

Boost LLM Inference Speed with Token‑Level Two‑Chunk Overlap

Token‑level Two‑Chunk Overlap replaces traditional batch‑level Two‑Batch Overlap, dynamically splitting sequences into balanced token chunks, enabling near‑equal compute and communication times, improving GPU utilization and achieving up to 30% throughput gains in heterogeneous request workloads, with zero accuracy loss.

Batch schedulingGPU utilizationLLM inference
0 likes · 9 min read
Boost LLM Inference Speed with Token‑Level Two‑Chunk Overlap