Unified Scheduling Optimization for xLLM in Complex Business Scenarios

This article analyzes how the xLLM open‑source LLM inference engine tackles the coexistence of multiple priority levels and strict SLO latency targets by introducing a dynamic, SLO‑aware batch scheduler and a PD‑separation architecture that improve throughput and SLO satisfaction across diverse workloads.

DataFunSummit
DataFunSummit
DataFunSummit
Unified Scheduling Optimization for xLLM in Complex Business Scenarios

Background

In the xLLM public inference cluster, requests from multiple business lines are mixed. Each request carries two constraints: an SLO latency requirement (first‑token‑time TTFT and token‑per‑output‑time TPOT) and a business priority (VIP requests have higher weight because their latency benefit is larger).

Static priority‑based clusters guarantee VIP latency but cannot adapt to dynamic traffic. Trace data shows that the proportion of high‑ and low‑priority traffic varies dramatically throughout the day, causing overloaded clusters and idle resources.

Challenges

Balancing latency and priority for multi‑priority scheduling.

Static policies (FCFS, EDF, SJF) either starve low‑priority requests or fail under bursty load.

KV‑Cache pressure: high load and frequent preemptions waste memory.

The Prefill/Decode (PD) instance ratio is fixed, making it hard to handle varying input‑output lengths.

Solution Overview

The team proposes two complementary techniques:

SLO‑aware adaptive batching (SlideBatching) that dynamically adjusts batch size and admission order based on estimated execution time, deadline proximity, and value density (priority weight / execution time).

PD separation that decouples Prefill and Decode into dedicated instances, adds capacity‑aware routing, and enables dynamic role conversion between Prefill and Decode.

SlideBatching Algorithm

The scheduler makes two decisions each round: batch capacity and request ordering. The four steps are:

Update request state using a lightweight latency estimator that predicts remaining time to deadline and computes value density. Separate linear‑regression models are trained for Prefill (quadratic complexity) and Decode (near‑linear complexity), achieving MAPE ≤ 4.5 % on diverse batch combinations.

Determine batch capacity as the shortest remaining deadline among queued requests, with a lower bound to avoid over‑fragmentation under heavy load.

Classify requests as urgent or non‑urgent via a load‑diagnosis function; urgent requests are sorted by value density, non‑urgent by deadline, forming a hierarchical admission queue.

Greedy fill the batch in order. If the last request does not fit, it is split using Chunked Prefill to exactly fill the remaining capacity. The filling process is modeled as a fractional‑knapsack problem where the latency budget is the knapsack capacity and token‑level execution time is the item size.

PD Separation and Capacity‑Aware Routing

Prefill and Decode are split into separate services. After Prefill finishes, the full KV‑Cache is transferred to a Decode instance, eliminating interference between the two phases. A Hierarchical Block Manager asynchronously evicts low‑priority blocks from GPU memory to host memory, while a pipelined loader overlaps computation and data transfer, dynamically adjusting the number of blocks loaded based on forward latency and transfer time.

Requests are routed using an SLO‑aware capacity‑aware strategy stored in Etcd. For each incoming request, the system predicts which Prefill instances can meet its TTFT SLO, forming a candidate set C. C is split into a light‑load set L and a heavy‑load set H using thresholds μ and λ. The router selects the most idle instance in L, or the least loaded instance in C when L is empty, or the relatively most loaded instance in H to avoid fragmenting capacity. If no Prefill instance can satisfy TTFT, a low‑load Decode instance is promoted to Prefill (P→D conversion); conversely, when TPOT becomes too high or KV‑Cache is exhausted, a Prefill instance is demoted to Decode (D→P conversion).

Evaluation

On both open‑source and industrial private traces, SlideBatching outperforms FCFS, pure SLO‑aware methods, and priority‑fusion schedulers in two metrics: overall throughput (benefit) and SLO satisfaction rate. High‑priority request latency is comparable to strict priority scheduling, while low‑priority latency is significantly better than SLO‑only methods, eliminating starvation.

Ablation studies show that removing the adaptive module or the latency estimator degrades benefit; replacing asynchronous eviction and pipelined loading with synchronous operations also hurts performance.

PD separation further improves performance: a dynamic P/D ratio beats static presets, and capacity‑aware routing surpasses naive load‑balancing. Combining multi‑priority scheduling with PD separation yields the best overall results.

Future Directions

Dynamic resource partitioning for high‑priority requests to avoid residual impact on low‑priority traffic.

Extending from single‑model to multi‑model clusters for long‑tail workloads.

Supporting different parallel strategies for Prefill and Decode (e.g., MoE‑based Decode benefiting from larger expert parallelism).

Joint decision‑making between PD separation and instance autoscaling.

Partial PD separation where Decode can handle a small amount of Prefill work, further optimizing TTFT under fixed resources.

GitHub repository: https://github.com/jd-opensource/xllm

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLM inferencexLLMKV CachePD separationSlideBatchingHierarchical Block ManagerSLO-aware scheduling
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.