Accelerate LLM Deployment on Baidu Kunlun XPU with the Open‑Source vLLM‑Kunlun Plugin
The vLLM‑Kunlun Plugin, built on the vLLM hardware‑plugin RFC, lets developers deploy any major large language model on Baidu's Kunlun XPU instantly without modifying vLLM core code, dramatically shortening migration time, providing high‑performance fusion operators, and offering open‑source tools for precision verification and profiling.
Background and Motivation
When the vLLM community adds support for a new model, developers targeting domestic chips usually need invasive code changes, taking three to four weeks and causing upgrade difficulties.
vLLM‑Kunlun Plugin Overview
The plugin, built on the vLLM hardware‑plugin RFC #11162, decouples the vLLM core from the Kunlun XPU backend. Installing the standard vLLM and the plugin enables immediate deployment of any major LLM on Kunlun XPU without modifying vLLM source.
Architecture Changes
In the traditional vLLM flow, the Engine schedules requests, Workers create ModelRunner instances (e.g., Qwen3MoeForCausalLM) that rely on CUDA kernels. With the plugin, during Engine initialization the RFC automatically registers the Kunlun plugin; Workers instantiate a Kunlun‑optimized ModelRunner and load a Kunlun‑specific model class (e.g., Qwen3MoeForCausalLM_Kunlun) that calls high‑performance Kunlun operators.
Benefits
Upgrading to a new vLLM engine version only requires aligning the ModelRunner interface in the plugin.
Supporting a new model architecture (e.g., DeepSeek‑V3.2) only needs updating the model‑graph logic inside the plugin while reusing existing high‑performance operators.
Adaptation time shrinks from weeks to days and remains compatible with upstream releases.
High‑Performance Fusion Operators
Custom operators such as Split_Norm_Rope and Fused MoE have been added to the Kunlun XPU operator library (xtorch_ops) to eliminate bottlenecks in Attention and MoE modules. Benchmarks on models like DeepSeek, Qwen, Llama, and GLM show that P800’s throughput and latency match leading AI accelerators.
Toolchain: torch_xray and xpu_profiler
torch_xrayprovides layer‑wise output comparison between GPU and P800 to locate numerical deviations; xpu_profiler offers nsys‑like profiling with clear operator timelines, helping developers pinpoint performance hotspots.
Model Coverage and Community
More than 20 mainstream and multimodal models (Qwen, DeepSeek, Llama, GLM, InternVL, GPT‑OSS, etc.) are already supported. The plugin is fully open‑source on GitHub together with documentation and the two tool packages.
Getting Started
Resources:
vLLM‑Kunlun Plugin repository: https://github.com/baidu/vLLM-Kunlun?tab=readme-ov-file
torch_xray wheel: https://su.bcebos.com/klx-sdk-release-public/xpytorch/dev_kl3/torch_xray/latest/torch_xray-999.9.9-cp310-cp310-linux_x86_64.whl
xpu_profiler tarball: https://klx-sdk-release-public.su.bcebos.com/v1/xre/xprofiler/release/xprofiler-Linux_x86_64-2.0.2.0.tar.gz
Community Involvement
Developers can follow the GitHub issue tracker and join the Slack channel (https://vllm-kunlun.slack.com) for technical support and to contribute upstream.
Baidu Intelligent Cloud Tech Hub
We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
