Accelerate LLM Deployment on Baidu Kunlun XPU with the Open‑Source vLLM‑Kunlun Plugin

The vLLM‑Kunlun Plugin, built on the vLLM hardware‑plugin RFC, lets developers deploy any major large language model on Baidu's Kunlun XPU instantly without modifying vLLM core code, dramatically shortening migration time, providing high‑performance fusion operators, and offering open‑source tools for precision verification and profiling.

Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Accelerate LLM Deployment on Baidu Kunlun XPU with the Open‑Source vLLM‑Kunlun Plugin

Background and Motivation

When the vLLM community adds support for a new model, developers targeting domestic chips usually need invasive code changes, taking three to four weeks and causing upgrade difficulties.

vLLM‑Kunlun Plugin Overview

The plugin, built on the vLLM hardware‑plugin RFC #11162, decouples the vLLM core from the Kunlun XPU backend. Installing the standard vLLM and the plugin enables immediate deployment of any major LLM on Kunlun XPU without modifying vLLM source.

Architecture Changes

In the traditional vLLM flow, the Engine schedules requests, Workers create ModelRunner instances (e.g., Qwen3MoeForCausalLM) that rely on CUDA kernels. With the plugin, during Engine initialization the RFC automatically registers the Kunlun plugin; Workers instantiate a Kunlun‑optimized ModelRunner and load a Kunlun‑specific model class (e.g., Qwen3MoeForCausalLM_Kunlun) that calls high‑performance Kunlun operators.

Benefits

Upgrading to a new vLLM engine version only requires aligning the ModelRunner interface in the plugin.

Supporting a new model architecture (e.g., DeepSeek‑V3.2) only needs updating the model‑graph logic inside the plugin while reusing existing high‑performance operators.

Adaptation time shrinks from weeks to days and remains compatible with upstream releases.

High‑Performance Fusion Operators

Custom operators such as Split_Norm_Rope and Fused MoE have been added to the Kunlun XPU operator library (xtorch_ops) to eliminate bottlenecks in Attention and MoE modules. Benchmarks on models like DeepSeek, Qwen, Llama, and GLM show that P800’s throughput and latency match leading AI accelerators.

Toolchain: torch_xray and xpu_profiler

torch_xray

provides layer‑wise output comparison between GPU and P800 to locate numerical deviations; xpu_profiler offers nsys‑like profiling with clear operator timelines, helping developers pinpoint performance hotspots.

Model Coverage and Community

More than 20 mainstream and multimodal models (Qwen, DeepSeek, Llama, GLM, InternVL, GPT‑OSS, etc.) are already supported. The plugin is fully open‑source on GitHub together with documentation and the two tool packages.

Getting Started

Resources:

vLLM‑Kunlun Plugin repository: https://github.com/baidu/vLLM-Kunlun?tab=readme-ov-file

torch_xray wheel: https://su.bcebos.com/klx-sdk-release-public/xpytorch/dev_kl3/torch_xray/latest/torch_xray-999.9.9-cp310-cp310-linux_x86_64.whl

xpu_profiler tarball: https://klx-sdk-release-public.su.bcebos.com/v1/xre/xprofiler/release/xprofiler-Linux_x86_64-2.0.2.0.tar.gz

Community Involvement

Developers can follow the GitHub issue tracker and join the Slack channel (https://vllm-kunlun.slack.com) for technical support and to contribute upstream.

LLMvLLMOpen-sourceInferenceXPUKunlun
Baidu Intelligent Cloud Tech Hub
Written by

Baidu Intelligent Cloud Tech Hub

We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.