12 min read

How vLLM‑Kunlun Brings CUDA‑Like Inference to Kunlun XPU: Architecture, Adaptation, and Performance Wins

This article details the vLLM‑Kunlun open‑source project that adapts the high‑performance vLLM inference engine to Baidu's Kunlun XPU, covering platform overview, model‑porting workflow, plugin architecture, concrete case studies with MIMO‑Flash‑V2 and Qwen 3.5, and the performance‑tuning techniques that enable seamless, GPU‑level inference on domestic hardware.

Baidu Intelligent Cloud Tech Hub

Mar 18, 2026

How vLLM‑Kunlun Brings CUDA‑Like Inference to Kunlun XPU: Architecture, Adaptation, and Performance Wins

Introduction

vLLM‑Kunlun is an open‑source plugin that enables the high‑performance inference framework vLLM to run on Baidu’s Kunlun XPU accelerators with the same developer experience as on CUDA GPUs. The goal is to hide hardware differences so that developers can use Kunlun XPU as if it were a standard CUDA device.

Architecture Overview

The plugin sits on top of the upstream vLLM codebase and handles model registration, attention dispatch, and other high‑level logic. Heavy computation is delegated to a Kunlun‑optimized operator library. The stack also includes the Kunlun XPU driver and a toolchain that provides debugging, profiling, and a stable compute foundation.

Key Compatibility Techniques

Programming‑model consistency: The plugin reports itself as a CUDA device, allowing vLLM’s existing CUDA‑specific code paths to be reused without modification.

Interface alignment: Kunlun’s memory‑management APIs mirror torch.cuda calls (e.g., torch.cuda.max_memory_allocated() ), so standard PyTorch code automatically maps to Kunlun implementations.

Operator registration: Kunlun‑specific kernels are registered with torch.library as CUDA back‑ends, enabling plug‑and‑play usage and compatibility with features such as Fake Tensor.

Standardized Development Flow

Adaptation follows a repeatable process: requirement alignment → interface adaptation → operator enhancement → integration testing → performance tuning. This workflow ensures rapid, stable migration to new vLLM releases.

Case Study 1 – MIMO‑Flash‑V2 Adaptation

When the community had not yet released official support for the MIMO‑Flash‑V2 attention pattern, the team used the vllm.general_plugins interface to register the model topology and redirected linear‑layer dependencies to the Kunlun plugin. Within two days the Kunlun team delivered new operators packaged in the kunlun_ops wheel. After pip install kunlun_ops, the model ran end‑to‑end on Kunlun XPU.

Case Study 2 – Qwen 3.5 Performance & Accuracy

Upgrading to vLLM‑Kunlun v0.15.1 enabled the full workflow for the Qwen 3.5 model (run, accuracy debugging, profiling) within two days. Node‑level output matched GPU results at 99.57 % except for a minor deviation in the LogitsProcessor layer caused by a buggy fully‑connected (FC) operator, which was fixed by the Kunlun chip team within one day.

Profiling was performed with the built‑in PyTorch Profiler (as described in the vLLM profiling guide). Two major bottlenecks were identified:

CPU‑GPU synchronization overhead: The causal_conv1d_fn operator repeatedly called cache_indices.cpu() , triggering costly device‑to‑host (D2H) synchronizations. The fix moved the creation of cache_indices_cpu to the model‑pre‑forward stage and performed initialization in the GDN attention metadata builder, eliminating the D2H syncs and dramatically improving throughput.

Excessive memory‑copy time: A boolean‑index assignment ( initial_state[~has_initial_state, ...] = 0 ) in the GDN Cache caused many memory accesses, becoming the core bottleneck. Replacing it with Kunlun‑specific reshape‑and‑cache operators reduced total time‑to‑first‑token (TTFT) by over 20 % for 4K‑token inputs.

Performance Results

After the two optimizations, Qwen 3.5 achieved:

99.57 % node‑level alignment with GPU reference.

TTFT improvement >20 % for 4K‑token prompts.

Elimination of D2H synchronization bubbles in the profiling trace.

Future Outlook

The project will track every major vLLM release, extend support to new generations of Kunlun chips, and contribute optimization and CI/CD experience back to the open‑source community. Expansion of the Model Zoo to cover more front‑line models is also planned.

Repository: https://github.com/baidu/vLLM-Kunlun

Performance AI vLLM hardware Inference Kunlun

Written by

Baidu Intelligent Cloud Tech Hub

We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Introduction

Architecture Overview

Key Compatibility Techniques

Standardized Development Flow

Case Study 1 – MIMO‑Flash‑V2 Adaptation

Case Study 2 – Qwen 3.5 Performance & Accuracy

Performance Results

Future Outlook

Baidu Intelligent Cloud Tech Hub

How this landed with the community

Was this worth your time?

0 Comments

Case Study 2 – Qwen 3.5 Performance & Accuracy