How vLLM‑Kunlun Unlocks Peak LLM Performance on Kunlun XPU
This article details the technical challenges of adapting the open‑source vLLM inference framework to Baidu's Kunlun XPU, outlines four major performance bottlenecks, and presents a multi‑dimensional optimization roadmap—including custom plugins, operator fusion, INT8 quantization, and CUDA‑Graph techniques—that together boost throughput by up to 8% and narrow the gap with leading GPU hardware.
Background
The vLLM‑Kunlun meetup (15 Mar 2026) examined the challenges of running open‑source large‑language‑model (LLM) inference on the domestically produced Kunlun XPU. The goal was to adapt the vLLM inference engine so that it can fully exploit Kunlun’s hardware units.
Key Pain Points
Kernel launch overhead : Frequent kernel invocations cause significant latency.
Framework‑level overhead : Native vLLM implementations cannot leverage Kunlun‑specific instructions, wasting compute cycles.
High‑performance operator adaptation : Core operators (MatMul, Attention, FFN, MoE, etc.) do not map efficiently to Kunlun accelerators.
Decode throughput : Low decode performance limits end‑user responsiveness.
vLLM‑Kunlun Plugin
A dedicated plugin isolates all Kunlun‑specific code from the generic vLLM core, allowing independent development and simplifying operator integration.
Operator‑Level Optimizations
Element‑wise operators (e.g., residual addition, activation functions) are mapped to Kunlun’s Cluster vector instructions for bandwidth‑efficient execution.
Transcendental functions ( exp, log, sin, cos) are offloaded to the SFU unit, providing low‑latency, high‑precision computation.
MatMul and Attention are accelerated on the SDNN unit, which is optimized for high‑throughput matrix operations.
FFN and PROJ modules use INT8 quantization while preserving model accuracy, reducing memory bandwidth and compute cost.
Fused MoE operator merges memory allocation, scheduling, and CPU overhead into a single kernel.
Attention scheduling separates Prefill and Decode phases, balancing load and maximizing compute utilization.
Dynamic MoE Optimization Paths
The product of token count M and top‑k determines the execution path:
SMALL PATH (M·top_k < 400): minimal preprocessing for ultra‑low latency.
MEDIUM PATH (400 ≤ M·top_k ≤ 768): enables sort_mode=True to balance preprocessing cost and compute efficiency.
LARGE PATH (M·top_k > 768): uses block statistics and pre‑sorting; when M ≥ 1024 the MoE kernel fuses the SWISH_GLU activation to cut memory bandwidth.
Split‑Norm‑RoPE‑Neox Fusion Operator
This fusion kernel combines Q/K normalization, RoPE positional encoding, and gating into a single launch, reducing kernel count from four to one and improving Prefill throughput by roughly 8%.
# Example call (torch.ops.xspeedgate_ops)
output = torch.ops.xspeedgate_ops.split_norm_rope_neox(
qkv, norm_weight, position_ids, num_heads, head_dim, ...
)INT8 Quantization Benchmarks
For the mimo‑v2‑flash model (input sequence = 8192, hidden = 1024), INT8 quantization outperforms BF16, achieving up to 5.45× speed‑up as batch size increases.
On the Kunlun P800, the Attention kernel runs in 0.102 ms versus 0.147 ms on an NVIDIA A‑series GPU (≈34 % faster). Other modules show comparable or better latency, confirming the hardware advantage after optimization.
CUDA‑Graph Optimizations
Three‑layer improvements eliminate CPU‑side scheduling overhead:
Implementation of Piecewise and Full‑and‑Piecewise CUDA graphs.
Targeted reduction of synchronization costs within the graph capture.
Simultaneous capture of compute and communication streams to overlap resource usage.
Benchmarks show that Full‑and‑Piecewise graphs double output throughput and substantially lower Mean TTFT (time‑to‑first‑token) and Mean TPOT (time‑per‑token) across batch sizes.
Framework‑Level Custom Operators
Replaced native indexer_k_quant_and_cache and flashinfer_rotary_embedding with custom kernels under torch.ops.xspeedgate_ops, exploiting Kunlun’s SIMD pathways.
Introduced get_masked_input_and_mask to replace Python‑based mask handling, leveraging XPU SIMD and model parameters for faster execution.
Random‑Sampling Optimizations
Developed inplace_exponential to generate exponential samples on‑device, avoiding CPU‑GPU synchronization.
Switched the random‑number generator from Philox4x32‑10 to Philox2x32‑10, matching XPU SIMD architecture and improving sampling throughput.
System‑Level Throughput Enhancements
Large‑scale EP parallelism : Utilizes multiple execution partitions on Kunlun for massive parallelism.
Dual‑Batch Overlap & DeepEP : Overlaps compute and communication to hide latency.
Speculative Decode (MTP) : Improves the Decode stage, a key factor for overall throughput.
Overall Results
Comparisons between Kunlun P800 and Baidu H20 show the P800 achieving up to 80 % of H20’s output throughput across most batch sizes, validating the effectiveness of the multi‑dimensional optimization strategy.
Project Reference
Source code and further details are available at https://github.com/baidu/vLLM-Kunlun.
Baidu Intelligent Cloud Tech Hub
We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
