How vLLM‑Kunlun Unlocks Peak LLM Performance on Kunlun XPU

This article details the technical challenges of adapting the open‑source vLLM inference framework to Baidu's Kunlun XPU, outlines four major performance bottlenecks, and presents a multi‑dimensional optimization roadmap—including custom plugins, operator fusion, INT8 quantization, and CUDA‑Graph techniques—that together boost throughput by up to 8% and narrow the gap with leading GPU hardware.

Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
How vLLM‑Kunlun Unlocks Peak LLM Performance on Kunlun XPU

Background

The vLLM‑Kunlun meetup (15 Mar 2026) examined the challenges of running open‑source large‑language‑model (LLM) inference on the domestically produced Kunlun XPU. The goal was to adapt the vLLM inference engine so that it can fully exploit Kunlun’s hardware units.

Key Pain Points

Kernel launch overhead : Frequent kernel invocations cause significant latency.

Framework‑level overhead : Native vLLM implementations cannot leverage Kunlun‑specific instructions, wasting compute cycles.

High‑performance operator adaptation : Core operators (MatMul, Attention, FFN, MoE, etc.) do not map efficiently to Kunlun accelerators.

Decode throughput : Low decode performance limits end‑user responsiveness.

vLLM‑Kunlun Plugin

A dedicated plugin isolates all Kunlun‑specific code from the generic vLLM core, allowing independent development and simplifying operator integration.

Operator‑Level Optimizations

Element‑wise operators (e.g., residual addition, activation functions) are mapped to Kunlun’s Cluster vector instructions for bandwidth‑efficient execution.

Transcendental functions ( exp, log, sin, cos) are offloaded to the SFU unit, providing low‑latency, high‑precision computation.

MatMul and Attention are accelerated on the SDNN unit, which is optimized for high‑throughput matrix operations.

FFN and PROJ modules use INT8 quantization while preserving model accuracy, reducing memory bandwidth and compute cost.

Fused MoE operator merges memory allocation, scheduling, and CPU overhead into a single kernel.

Attention scheduling separates Prefill and Decode phases, balancing load and maximizing compute utilization.

Dynamic MoE Optimization Paths

The product of token count M and top‑k determines the execution path:

SMALL PATH (M·top_k < 400): minimal preprocessing for ultra‑low latency.

MEDIUM PATH (400 ≤ M·top_k ≤ 768): enables sort_mode=True to balance preprocessing cost and compute efficiency.

LARGE PATH (M·top_k > 768): uses block statistics and pre‑sorting; when M ≥ 1024 the MoE kernel fuses the SWISH_GLU activation to cut memory bandwidth.

Split‑Norm‑RoPE‑Neox Fusion Operator

This fusion kernel combines Q/K normalization, RoPE positional encoding, and gating into a single launch, reducing kernel count from four to one and improving Prefill throughput by roughly 8%.

# Example call (torch.ops.xspeedgate_ops)
output = torch.ops.xspeedgate_ops.split_norm_rope_neox(
    qkv, norm_weight, position_ids, num_heads, head_dim, ...
)

INT8 Quantization Benchmarks

For the mimo‑v2‑flash model (input sequence = 8192, hidden = 1024), INT8 quantization outperforms BF16, achieving up to 5.45× speed‑up as batch size increases.

On the Kunlun P800, the Attention kernel runs in 0.102 ms versus 0.147 ms on an NVIDIA A‑series GPU (≈34 % faster). Other modules show comparable or better latency, confirming the hardware advantage after optimization.

CUDA‑Graph Optimizations

Three‑layer improvements eliminate CPU‑side scheduling overhead:

Implementation of Piecewise and Full‑and‑Piecewise CUDA graphs.

Targeted reduction of synchronization costs within the graph capture.

Simultaneous capture of compute and communication streams to overlap resource usage.

Benchmarks show that Full‑and‑Piecewise graphs double output throughput and substantially lower Mean TTFT (time‑to‑first‑token) and Mean TPOT (time‑per‑token) across batch sizes.

Framework‑Level Custom Operators

Replaced native indexer_k_quant_and_cache and flashinfer_rotary_embedding with custom kernels under torch.ops.xspeedgate_ops, exploiting Kunlun’s SIMD pathways.

Introduced get_masked_input_and_mask to replace Python‑based mask handling, leveraging XPU SIMD and model parameters for faster execution.

Random‑Sampling Optimizations

Developed inplace_exponential to generate exponential samples on‑device, avoiding CPU‑GPU synchronization.

Switched the random‑number generator from Philox4x32‑10 to Philox2x32‑10, matching XPU SIMD architecture and improving sampling throughput.

System‑Level Throughput Enhancements

Large‑scale EP parallelism : Utilizes multiple execution partitions on Kunlun for massive parallelism.

Dual‑Batch Overlap & DeepEP : Overlaps compute and communication to hide latency.

Speculative Decode (MTP) : Improves the Decode stage, a key factor for overall throughput.

Overall Results

Comparisons between Kunlun P800 and Baidu H20 show the P800 achieving up to 80 % of H20’s output throughput across most batch sizes, validating the effectiveness of the multi‑dimensional optimization strategy.

Project Reference

Source code and further details are available at https://github.com/baidu/vLLM-Kunlun.

vLLMLLM inferenceCUDA GraphINT8 quantizationOperator fusionKunlun XPU
Baidu Intelligent Cloud Tech Hub
Written by

Baidu Intelligent Cloud Tech Hub

We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.