Boost LLM Inference Speed with Token‑Level Two‑Chunk Overlap
Token‑level Two‑Chunk Overlap replaces traditional batch‑level Two‑Batch Overlap, dynamically splitting sequences into balanced token chunks, enabling near‑equal compute and communication times, improving GPU utilization and achieving up to 30% throughput gains in heterogeneous request workloads, with zero accuracy loss.
In large‑model inference systems, maximizing the overlap between computation and communication is a key design goal. When a portion of the GPU performs forward computation, another portion can simultaneously launch communication operations such as AlltoAll or AllGather, hiding communication latency and improving hardware utilization.
Traditional Two‑Batch Overlap (TBO)
TBO splits a batch into two micro‑batches at the sequence level and interleaves their layer‑wise execution, assuming the two micro‑batches have comparable compute and communication costs. This works well only when request lengths are homogeneous.
Problems with TBO in heterogeneous workloads
Real‑world traffic contains highly variable token counts—some requests exceed 3000 tokens while others are only a few dozen. Splitting by whole sequences leads to severe load imbalance: the short micro‑batch finishes quickly while the long one is still communicating, leaving most GPU resources idle. In extreme cases a single long request forces the creation of an empty micro‑batch, further wasting capacity.
Token‑level Two‑Chunk Overlap
Token Two‑Chunk Overlap (also called Token 双流) partitions each sequence at the token granularity into chunks, then distributes those chunks across two micro‑batches so that the total token count of each micro‑batch is roughly equal. This dynamic chunking balances compute load and enables effective compute‑communication overlap even when only one long request is present.
Batch: a collection of user requests processed together. Sequence: an individual request within a batch, with variable length. Token: the smallest semantic unit (word or sub‑word) processed by the model. Micro‑batch: a subset of a batch used for pipeline scheduling. Chunk: a token‑level slice of a single sequence, used only for scheduling.
Illustrative example
Consider two requests: A with 2900 tokens and B with 100 tokens. TBO would create micro‑batch0 containing A and micro‑batch1 containing B, causing a 30× workload imbalance.
Token Two‑Chunk Overlap splits request A into two chunks (1500 and 1400 tokens) and combines the second chunk with request B, yielding:
micro‑batch0: 1500 tokens (first half of A)
micro‑batch1: 1500 tokens (second half of A + B)
Both micro‑batches now have comparable compute and communication times, allowing the scheduler to interleave operations and achieve full overlap.
Correctness guarantee
The second chunk starts its layer computation only after the corresponding layer of the first chunk finishes, and the KV cache (or latent state) produced by the first chunk is passed as a prefix to the second. This preserves the exact forward‑pass semantics, so inference results are numerically identical to processing the original uninterrupted sequence.
Performance evaluation
On a 2×8×H800 cluster using the DeepSeek‑V3‑0324 model, Token Two‑Chunk Overlap delivered:
12.56% higher throughput for a single long request (3072 tokens per DP node).
5.15% higher throughput for mixed‑length requests (30–3072 tokens).
Up to 30% single‑node throughput improvement in Baidu Baige’s production service, while keeping first‑token latency (TTFT) below one second.
The technique works with various attention mechanisms (MLA, GQA, MHA) and has been validated in models such as DeepSeek and Qwen.
Implementation details
The system automatically activates token‑level chunking when the token‑count ratio between two micro‑batches falls outside the [0.9, 1.1] range. The threshold can be tuned via the --tbo-token-distribution-threshold flag. The core implementation was merged into the SGLang open‑source project (GitHub PR #8144).
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baidu Intelligent Cloud Tech Hub
We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
