Operations 8 min read

How vLLM‑Kunlun Plugin Enabled Two‑Day Adaptation of MiMo Flash V2 on Kunlun P800 XPU

In just two days, Baidu Baige and Kunlun's engineers extended the vLLM‑Kunlun Plugin to overcome asymmetric KV dimensions and integrate SWA+Sink attention, achieving lossless, high‑performance inference of the MiMo Flash V2 model on the Kunlun P800 XPU.

Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
How vLLM‑Kunlun Plugin Enabled Two‑Day Adaptation of MiMo Flash V2 on Kunlun P800 XPU

MiMo Flash V2 is a 309 billion‑parameter Mixture‑of‑Experts (MoE) model that combines sliding‑window attention with anchored tokens (SWA+Sink) and full attention. Its hybrid attention uses asymmetric key/value dimensions (192 key heads, 128 value heads), which prevents direct deployment on standard GPU/XPU kernels.

Adapting the asymmetric KV architecture

The model was only available in a vLLM development branch and not in the stable v0.13.0 release. Using the vLLM‑Kunlun Plugin, the team performed two main modifications:

Inherited QKVParallelLinear from vLLM v0.11.0 and rewrote its core to allocate separate weight tensors for keys and values, thereby resolving the dimension mismatch.

Implemented a custom model class MiMoV2FlashForCausalLM_KUNLUN that connects the high‑level model to the Kunlun P800 XPU kernels, following the latest community model‑graph design.

These changes required only minor interface adjustments and weight‑loading optimizations, demonstrating the plugin’s extensibility. For models already supported in official vLLM releases, only lightweight fine‑tuning is needed.

Performance and precision optimization

Parallel to the framework adaptation, the team extended the existing SWA attention operator with a Sink function, completing the work in two days. They used two diagnostic tools: torch_xray to automatically compare layer‑wise outputs between GPU and P800, identifying numerical deviations. xpu_profiler to generate detailed operator timelines and locate performance bottlenecks.

After alignment, inference on the P800 matched GPU accuracy with no loss, providing a lossless high‑efficiency inference path.

Future‑ready Multi‑Token Prediction (MTP) support

The new SWA+Sink operator supports a query length up to 32 during decoding, far exceeding MiMo Flash V2’s current requirement (N=3). Once community MTP issues are resolved, the P800 can immediately leverage this accelerated path.

Conclusion

The entire workflow—from architectural adaptation to performance‑precision tuning—was completed in two days, proving that the vLLM‑Kunlun Plugin can rapidly bridge open‑source large‑model ecosystems with Chinese XPU hardware. The plugin follows the vLLM RFC #11162 hardware‑plugin standard and already supports more than 20 mainstream and multimodal model families (e.g., Qwen, DeepSeek, Llama). Repository: https://github.com/baidu/vLLM-Kunlun

Performance optimizationvLLMXPUModel AdaptationHybrid attentionKunlun P800MiMo Flash V2
Baidu Intelligent Cloud Tech Hub
Written by

Baidu Intelligent Cloud Tech Hub

We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.