How Baidu’s vLLM‑Kunlun Plugin Powered MiMo Flash V2 on Kunlun XPU in 2 Days

Within two days, Baidu’s Baige and Kunlun Chip teams adapted the 309‑billion‑parameter MiMo Flash V2 model—featuring a hybrid SWA+Sink and Full Attention mechanism—to run efficiently on the Kunlun P800 XPU using the vLLM‑Kunlun Plugin, achieving lossless performance comparable to GPU inference.

Baidu Geek Talk
Baidu Geek Talk
Baidu Geek Talk
How Baidu’s vLLM‑Kunlun Plugin Powered MiMo Flash V2 on Kunlun XPU in 2 Days

Background

MiMo Flash V2, released by Xiaomi, is a 309‑billion‑parameter Mixture‑of‑Experts model that combines SWA+Sink (sliding‑window attention + anchored tokens) with Full Attention, offering superior inference efficiency.

Challenge

The hybrid attention architecture creates mismatched KV dimensions (192 Key heads vs 128 Value heads) that are not supported by the current vLLM codebase, and no stable release includes the necessary adaptations.

Solution Overview

Using the vLLM‑Kunlun Plugin, Baidu Baige and Kunlun Chip teams implemented two key modifications in two days to enable full‑process execution of MiMo Flash V2 on the Kunlun P800 XPU.

Implementation Steps

Inherited the QKVParallelLinear class from vLLM v0.11.0 Linear module and altered its core logic in the plugin to create separate tensors for Key and Value weights, resolving the asymmetric dimension issue.

Created a custom model class MiMoV2FlashForCausalLM_KUNLUN that bridges the high‑level model with the P800 backend, following the community’s latest model‑graph design.

Performance Optimization

The team also added a Sink operator to the existing SWA attention kernel, integrating it into the high‑performance operator library. Using torch_xray for precision alignment and xpu_profiler for performance bottleneck analysis, they ensured lossless accuracy and high throughput on the P800.

Additional Features

They pre‑implemented Multi‑Token Prediction (MTP) support, allowing query lengths up to 32 in the decode stage, exceeding the model’s native requirement of 4, so the P800 can immediately leverage accelerated MTP once the community fixes related issues.

Conclusion

In just two days, the plugin demonstrated that a complex, asymmetric‑KV model can be adapted to a domestic XPU with no loss of accuracy and with GPU‑level performance, showcasing the rapid co‑evolution of open‑source AI frameworks and Chinese hardware.

For more details and deployment instructions, see the vLLM‑Kunlun GitHub repository: https://github.com/baidu/vLLM-Kunlun

vLLMAI inferenceKunlun XPUMiMo Flash V2
Baidu Geek Talk
Written by

Baidu Geek Talk

Follow us to discover more Baidu tech insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.