Accelerate LLM Deployment on Baidu Kunlun XPU with the Open‑Source vLLM‑Kunlun Plugin
The vLLM‑Kunlun Plugin, jointly released by Baidu Baige and Kunlun Chip, provides a high‑performance, zero‑intrusion solution for deploying open‑source large language models on domestic Kunlun XPU hardware, includes fused operators, precision‑validation and profiling tools, and supports over twenty mainstream and multimodal models.
Background
Deploying open‑source large language models (LLMs) on domestic chips has traditionally required invasive code changes and weeks of engineering effort, creating bottlenecks in efficiency and performance.
Plugin Overview
To address this, Baidu Baige and Kunlun Chip have open‑sourced the vLLM‑Kunlun Plugin , a hardware‑plugin that conforms to the vLLM community RFC #11162 standard and decouples the vLLM core from the Kunlun XPU backend.
Zero‑Intrusion Integration
Developers only need a standard vLLM installation and the plugin; no modifications to vLLM core code are required. The plugin automatically registers during engine initialization, creates a Kunlun‑optimized ModelRunner, and loads custom model classes (e.g., Qwen3MoeForCausalLM_Kunlun) that invoke high‑performance Kunlun operators.
Performance‑Focused Operator Fusion
Specialized fused operators such as Split_Norm_Rope and Fused MoE have been added to the Kunlun operator library (e.g., xtorch_ops), eliminating bottlenecks in attention and MoE modules and matching the throughput and latency of mainstream AI accelerators across models like DeepSeek, Qwen, Llama, and GLM.
Toolchain: torch_xray and xpu_profiler
The release also includes two internally validated tools: torch_xray for layer‑wise precision comparison between GPU and Kunlun P800, and xpu_profiler , a nsys‑like profiler that generates clear operator call timelines to pinpoint performance hotspots.
Model Coverage
To date the plugin supports more than 20 mainstream and multimodal model families—including Qwen, DeepSeek‑V3.2, Llama, GLM, InternVL, and GPT‑OSS—allowing both open‑source and private models to be deployed and optimized on Kunlun P800 with minimal effort.
Open Collaboration
The full source code, documentation, and toolchains are available on GitHub (https://github.com/baidu/vLLM-Kunlun). Community contributions are welcomed via GitHub Issues and the official Slack workspace (https://vllm-kunlun.slack.com/), enabling direct upstream integration of feature requests and bug fixes.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
