Baidu Intelligent Cloud Tech Hub
Nov 19, 2025 · Artificial Intelligence
Boost LLM Inference Speed with Token‑Level Two‑Chunk Overlap
Token‑level Two‑Chunk Overlap replaces traditional batch‑level Two‑Batch Overlap, dynamically splitting sequences into balanced token chunks, enabling near‑equal compute and communication times, improving GPU utilization and achieving up to 30% throughput gains in heterogeneous request workloads, with zero accuracy loss.
Batch schedulingGPU utilizationLLM inference
0 likes · 9 min read
