Artificial Intelligence 7 min read

Accelerate LLM Deployment on Baidu Kunlun XPU with the Open‑Source vLLM‑Kunlun Plugin

The vLLM‑Kunlun Plugin, jointly released by Baidu Baige and Kunlun Chip, provides a high‑performance, zero‑intrusion solution for deploying open‑source large language models on domestic Kunlun XPU hardware, includes fused operators, precision‑validation and profiling tools, and supports over twenty mainstream and multimodal models.

Baidu Geek Talk

Dec 17, 2025

Accelerate LLM Deployment on Baidu Kunlun XPU with the Open‑Source vLLM‑Kunlun Plugin

Background

Deploying open‑source large language models (LLMs) on domestic chips has traditionally required invasive code changes and weeks of engineering effort, creating bottlenecks in efficiency and performance.

Plugin Overview

To address this, Baidu Baige and Kunlun Chip have open‑sourced the vLLM‑Kunlun Plugin , a hardware‑plugin that conforms to the vLLM community RFC #11162 standard and decouples the vLLM core from the Kunlun XPU backend.

Zero‑Intrusion Integration

Developers only need a standard vLLM installation and the plugin; no modifications to vLLM core code are required. The plugin automatically registers during engine initialization, creates a Kunlun‑optimized ModelRunner, and loads custom model classes (e.g., Qwen3MoeForCausalLM_Kunlun) that invoke high‑performance Kunlun operators.

Performance‑Focused Operator Fusion

Specialized fused operators such as Split_Norm_Rope and Fused MoE have been added to the Kunlun operator library (e.g., xtorch_ops), eliminating bottlenecks in attention and MoE modules and matching the throughput and latency of mainstream AI accelerators across models like DeepSeek, Qwen, Llama, and GLM.

Toolchain: torch_xray and xpu_profiler

The release also includes two internally validated tools: torch_xray for layer‑wise precision comparison between GPU and Kunlun P800, and xpu_profiler , a nsys‑like profiler that generates clear operator call timelines to pinpoint performance hotspots.

Model Coverage

To date the plugin supports more than 20 mainstream and multimodal model families—including Qwen, DeepSeek‑V3.2, Llama, GLM, InternVL, and GPT‑OSS—allowing both open‑source and private models to be deployed and optimized on Kunlun P800 with minimal effort.

Open Collaboration

The full source code, documentation, and toolchains are available on GitHub (https://github.com/baidu/vLLM-Kunlun). Community contributions are welcomed via GitHub Issues and the official Slack workspace (https://vllm-kunlun.slack.com/), enabling direct upstream integration of feature requests and bug fixes.

performance optimization model deployment vLLM Open Source Kunlun XPU

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.