How Slidebatching Revolutionizes LLM Inference Scheduling for Faster, More Efficient AI Services
The article examines the memory and latency challenges of 1750‑billion‑parameter LLM inference, introduces the xLLM framework’s Slidebatching and PD‑separation scheduling strategies, and details how these techniques achieve up to 35% system‑throughput gains and 52% SLO compliance improvements in real‑world multi‑priority workloads.
Large language model (LLM) inference faces two conflicting constraints: limited GPU VRAM and sub‑50 ms per‑token latency (TPOT) required for interactive services. JD.com’s production environment must serve mixed‑load, multi‑priority requests on shared clusters while guaranteeing per‑request Service Level Objectives (SLOs).
xLLM Inference Framework
The xLLM framework (open‑sourced on GitHub in August 2025) introduces a global multi‑level scheduler combined with multi‑stage pipeline parallelism to address VRAM fragmentation and latency volatility.
Slidebatching Algorithm
Slidebatching continuously evaluates each request’s urgency using three signals:
Current queue length.
Estimated execution time of the request.
Remaining time before the request’s SLO deadline.
Based on the aggregated urgency score, the scheduler operates in two modes:
Low‑load (deadline‑first) mode: All requests are prioritized by earliest deadline, aiming to satisfy every SLO.
High‑load (value‑density‑first) mode: Requests are ranked by benefit‑per‑time‑unit (e.g., revenue or priority weight). The scheduler selects the highest‑density requests first, while still respecting hard deadlines.
An asynchronous pipeline hides the scheduling overhead: the CPU “pre‑issues” a placeholder token for the next step while the accelerator processes the current token. When the real token is ready, it replaces the placeholder, allowing the next step to start immediately. This overlap yields a 35 % increase in system throughput and a 52 % improvement in SLO compliance , with TPOT latency reduced by over 30 % in TPOT‑sensitive workloads.
PD‑Separation Dual‑Threshold Scheduling
After separating Prefill (P) and Decode (D) stages, the scheduler uses two VRAM‑based thresholds to control instance admission:
Upper threshold: When the number of free VRAM blocks falls below this level, new Prefill requests are blocked to avoid over‑committing memory.
Lower threshold: When free VRAM blocks exceed this level, idle Decode instances are shut down or repurposed, preventing waste.
Decode instances are memory‑intensive because they store KV‑Cache; therefore the scheduler monitors free VRAM block count rather than raw compute load to decide where to place Decode workloads.
Asynchronous Pipeline Optimization
The pipeline removes step‑wise serial dependencies. For each token generation step i:
// Step i (real computation on accelerator)
accelerator.compute(token_i);
// CPU immediately creates a fake token for step i+1
cpu.enqueue(fake_token_{i+1});
// When accelerator finishes token_i, replace fake token with real token
cpu.replace(fake_token_{i+1}, token_i);
// Step i+1 can start without waiting for the replacementThis design lets CPU scheduling and accelerator execution run in parallel, effectively halving pipeline latency.
Hybrid PD Mode
When the time‑to‑first‑token (TTFT) SLO is strict but TPOT tolerance is looser, a hybrid mode allows a limited number of Prefill requests to be processed on Decode instances. This trades a small increase in TPOT for a large reduction in TTFT, benefiting bursty short requests without degrading overall throughput.
Threshold Configuration and Monitoring
Both thresholds are derived from offline profiling of typical request patterns (e.g., request length distribution, priority mix). The scheduler also tracks:
Number of free VRAM blocks per instance (proxy for KV‑Cache capacity).
Real‑time TTFT and TPOT metrics per instance.
When all Prefill instances cannot meet an incoming request’s TTFT deadline, the system promotes a Decode instance to Prefill (D→P). Conversely, if a Decode instance’s TPOT exceeds its SLO or its KV‑Cache is near exhaustion, it is demoted to Decode‑only (P→D). A lightweight prediction module continuously updates these decisions.
Future Directions
The roadmap envisions moving from resource‑centric to intent‑centric scheduling , where the engine predicts workload shifts and proactively reallocates resources. Key research areas include:
State‑aware serverless architectures that preserve KV‑Cache across elastic scaling.
Real‑time model fine‑tuning driven by live traffic patterns.
Automatic selection of parallelism strategies (e.g., DP for Prefill, large‑scale EP for Decode) based on model type (MoE, Mamba) and request characteristics.
Overall, xLLM demonstrates that fine‑grained, latency‑aware scheduling combined with VRAM‑aware resource management can substantially improve LLM inference efficiency in production multi‑tenant environments.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
