How vLLM‑Kunlun Brings CUDA‑Like Inference to Kunlun XPU: Architecture, Adaptation, and Performance Wins
This article details the vLLM‑Kunlun open‑source project that adapts the high‑performance vLLM inference engine to Baidu's Kunlun XPU, covering platform overview, model‑porting workflow, plugin architecture, concrete case studies with MIMO‑Flash‑V2 and Qwen 3.5, and the performance‑tuning techniques that enable seamless, GPU‑level inference on domestic hardware.
Introduction
vLLM‑Kunlun is an open‑source plugin that enables the high‑performance inference framework vLLM to run on Baidu’s Kunlun XPU accelerators with the same developer experience as on CUDA GPUs. The goal is to hide hardware differences so that developers can use Kunlun XPU as if it were a standard CUDA device.
Architecture Overview
The plugin sits on top of the upstream vLLM codebase and handles model registration, attention dispatch, and other high‑level logic. Heavy computation is delegated to a Kunlun‑optimized operator library. The stack also includes the Kunlun XPU driver and a toolchain that provides debugging, profiling, and a stable compute foundation.
Key Compatibility Techniques
Programming‑model consistency: The plugin reports itself as a CUDA device, allowing vLLM’s existing CUDA‑specific code paths to be reused without modification.
Interface alignment: Kunlun’s memory‑management APIs mirror torch.cuda calls (e.g., torch.cuda.max_memory_allocated() ), so standard PyTorch code automatically maps to Kunlun implementations.
Operator registration: Kunlun‑specific kernels are registered with torch.library as CUDA back‑ends, enabling plug‑and‑play usage and compatibility with features such as Fake Tensor.
Standardized Development Flow
Adaptation follows a repeatable process: requirement alignment → interface adaptation → operator enhancement → integration testing → performance tuning. This workflow ensures rapid, stable migration to new vLLM releases.
Case Study 1 – MIMO‑Flash‑V2 Adaptation
When the community had not yet released official support for the MIMO‑Flash‑V2 attention pattern, the team used the vllm.general_plugins interface to register the model topology and redirected linear‑layer dependencies to the Kunlun plugin. Within two days the Kunlun team delivered new operators packaged in the kunlun_ops wheel. After pip install kunlun_ops, the model ran end‑to‑end on Kunlun XPU.
Case Study 2 – Qwen 3.5 Performance & Accuracy
Upgrading to vLLM‑Kunlun v0.15.1 enabled the full workflow for the Qwen 3.5 model (run, accuracy debugging, profiling) within two days. Node‑level output matched GPU results at 99.57 % except for a minor deviation in the LogitsProcessor layer caused by a buggy fully‑connected (FC) operator, which was fixed by the Kunlun chip team within one day.
Profiling was performed with the built‑in PyTorch Profiler (as described in the vLLM profiling guide). Two major bottlenecks were identified:
CPU‑GPU synchronization overhead: The causal_conv1d_fn operator repeatedly called cache_indices.cpu() , triggering costly device‑to‑host (D2H) synchronizations. The fix moved the creation of cache_indices_cpu to the model‑pre‑forward stage and performed initialization in the GDN attention metadata builder, eliminating the D2H syncs and dramatically improving throughput.
Excessive memory‑copy time: A boolean‑index assignment ( initial_state[~has_initial_state, ...] = 0 ) in the GDN Cache caused many memory accesses, becoming the core bottleneck. Replacing it with Kunlun‑specific reshape‑and‑cache operators reduced total time‑to‑first‑token (TTFT) by over 20 % for 4K‑token inputs.
Performance Results
After the two optimizations, Qwen 3.5 achieved:
99.57 % node‑level alignment with GPU reference.
TTFT improvement >20 % for 4K‑token prompts.
Elimination of D2H synchronization bubbles in the profiling trace.
Future Outlook
The project will track every major vLLM release, extend support to new generations of Kunlun chips, and contribute optimization and CI/CD experience back to the open‑source community. Expansion of the Model Zoo to cover more front‑line models is also planned.
Repository: https://github.com/baidu/vLLM-Kunlun
Baidu Intelligent Cloud Tech Hub
We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
