Accelerating GLM‑4.x Inference on Kunlun XPU with SGLang & vLLM

Baidu’s Baige team successfully adapted the GLM‑4.x series language models to the Kunlun XPU platform by leveraging SGLang and the vLLM‑Kunlun plugin, employing agile adaptation, precision alignment with torch_xray, and extensive performance tuning to achieve GPU‑level accuracy and superior inference speed.

Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Accelerating GLM‑4.x Inference on Kunlun XPU with SGLang & vLLM

Ba​idu’s Baige team adapted the GLM‑4.x series language models to the Kunlun XPU platform using the SGLang codebase and the vLLM‑Kunlun plugin, enabling fast deployment for enterprise users while maintaining high inference quality.

GLM‑4.x adaptation diagram
GLM‑4.x adaptation diagram

Agile Adaptation

Leveraging Kunlun XPU’s high‑performance operator library (Flash Attention, Page Attention, Fused MoE, etc.), the team built the vLLM‑Kunlun plugin that decouples the XPU backend from the vLLM core, allowing seamless migration from GPU to XPU and automatic synchronization with the latest vLLM community releases.

Precision Alignment

The torch_xray tool was used for layer‑wise and operator‑wise numeric alignment of GLM‑4.x inference on XPU. When a precision anomaly was detected, developers quickly traced the cause in the code, fixing it to ensure that XPU outputs match GPU results; the same tool also helped locate new precision issues introduced during performance tuning.

Performance Tuning

Using the XPU‑profiler, the team generated execution timelines and compared them with GPU baselines. Combined with model code analysis, they performed systematic optimizations across three dimensions:

Operator level: identified hot operators from profiling, created fused operators (e.g., moe_ffn_block, moe_gate_ops) to reduce kernel launches and TTFT, optimized Fused MoE to lower memory‑bandwidth usage, and introduced specialized Prefill_attention and Decode_attention operators for long‑sequence parallelism and low‑latency token decoding.

Compute execution level: employed XPU Graph to record and reuse static computation graphs, minimizing CPU‑induced execution bubbles and increasing effective XPU compute time.

Communication level: added an adaptive communication‑threshold mechanism that switches All‑Reduce algorithms—using butterfly for traffic below 2 MB and ring for larger traffic—to achieve optimal bandwidth utilization in both small‑ and large‑scale scenarios.

Through dual‑framework adaptation, a plugin‑driven iterative process, and full‑stack system optimizations, the GLM‑4.x series delivers inference accuracy identical to GPU while achieving stable, high‑performance execution on Kunlun XPU, fully exploiting the hardware’s capabilities.

The Baige team now possesses a reusable adaptation framework for large‑model inference on XPU, supporting rapid deployment across multiple frameworks and meeting performance and cost targets for enterprise AI applications in the domestic market.

Performance optimization flow
Performance optimization flow
AIlarge language modelsModel InferenceHardware AccelerationXPU
Baidu Intelligent Cloud Tech Hub
Written by

Baidu Intelligent Cloud Tech Hub

We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.