How Slidebatching Revolutionizes LLM Inference Scheduling for Faster, More Efficient AI Services

The article examines the memory and latency challenges of 1750‑billion‑parameter LLM inference, introduces the xLLM framework’s Slidebatching and PD‑separation scheduling strategies, and details how these techniques achieve up to 35% system‑throughput gains and 52% SLO compliance improvements in real‑world multi‑priority workloads.

DataFunSummit
DataFunSummit
DataFunSummit
How Slidebatching Revolutionizes LLM Inference Scheduling for Faster, More Efficient AI Services

Large language model (LLM) inference faces two conflicting constraints: limited GPU VRAM and sub‑50 ms per‑token latency (TPOT) required for interactive services. JD.com’s production environment must serve mixed‑load, multi‑priority requests on shared clusters while guaranteeing per‑request Service Level Objectives (SLOs).

xLLM Inference Framework

The xLLM framework (open‑sourced on GitHub in August 2025) introduces a global multi‑level scheduler combined with multi‑stage pipeline parallelism to address VRAM fragmentation and latency volatility.

Slidebatching Algorithm

Slidebatching continuously evaluates each request’s urgency using three signals:

Current queue length.

Estimated execution time of the request.

Remaining time before the request’s SLO deadline.

Based on the aggregated urgency score, the scheduler operates in two modes:

Low‑load (deadline‑first) mode: All requests are prioritized by earliest deadline, aiming to satisfy every SLO.

High‑load (value‑density‑first) mode: Requests are ranked by benefit‑per‑time‑unit (e.g., revenue or priority weight). The scheduler selects the highest‑density requests first, while still respecting hard deadlines.

An asynchronous pipeline hides the scheduling overhead: the CPU “pre‑issues” a placeholder token for the next step while the accelerator processes the current token. When the real token is ready, it replaces the placeholder, allowing the next step to start immediately. This overlap yields a 35 % increase in system throughput and a 52 % improvement in SLO compliance , with TPOT latency reduced by over 30 % in TPOT‑sensitive workloads.

PD‑Separation Dual‑Threshold Scheduling

After separating Prefill (P) and Decode (D) stages, the scheduler uses two VRAM‑based thresholds to control instance admission:

Upper threshold: When the number of free VRAM blocks falls below this level, new Prefill requests are blocked to avoid over‑committing memory.

Lower threshold: When free VRAM blocks exceed this level, idle Decode instances are shut down or repurposed, preventing waste.

Decode instances are memory‑intensive because they store KV‑Cache; therefore the scheduler monitors free VRAM block count rather than raw compute load to decide where to place Decode workloads.

Asynchronous Pipeline Optimization

The pipeline removes step‑wise serial dependencies. For each token generation step i:

// Step i (real computation on accelerator)
accelerator.compute(token_i);
// CPU immediately creates a fake token for step i+1
cpu.enqueue(fake_token_{i+1});

// When accelerator finishes token_i, replace fake token with real token
cpu.replace(fake_token_{i+1}, token_i);
// Step i+1 can start without waiting for the replacement

This design lets CPU scheduling and accelerator execution run in parallel, effectively halving pipeline latency.

Hybrid PD Mode

When the time‑to‑first‑token (TTFT) SLO is strict but TPOT tolerance is looser, a hybrid mode allows a limited number of Prefill requests to be processed on Decode instances. This trades a small increase in TPOT for a large reduction in TTFT, benefiting bursty short requests without degrading overall throughput.

Threshold Configuration and Monitoring

Both thresholds are derived from offline profiling of typical request patterns (e.g., request length distribution, priority mix). The scheduler also tracks:

Number of free VRAM blocks per instance (proxy for KV‑Cache capacity).

Real‑time TTFT and TPOT metrics per instance.

When all Prefill instances cannot meet an incoming request’s TTFT deadline, the system promotes a Decode instance to Prefill (D→P). Conversely, if a Decode instance’s TPOT exceeds its SLO or its KV‑Cache is near exhaustion, it is demoted to Decode‑only (P→D). A lightweight prediction module continuously updates these decisions.

Future Directions

The roadmap envisions moving from resource‑centric to intent‑centric scheduling , where the engine predicts workload shifts and proactively reallocates resources. Key research areas include:

State‑aware serverless architectures that preserve KV‑Cache across elastic scaling.

Real‑time model fine‑tuning driven by live traffic patterns.

Automatic selection of parallelism strategies (e.g., DP for Prefill, large‑scale EP for Decode) based on model type (MoE, Mamba) and request characteristics.

Overall, xLLM demonstrates that fine‑grained, latency‑aware scheduling combined with VRAM‑aware resource management can substantially improve LLM inference efficiency in production multi‑tenant environments.

LLMschedulingSLOAI performancePD separationSlidebatching
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.