Accelerating GLM‑4.x Inference on Kunlun XPU with SGLang & vLLM
Baidu’s Baige team successfully adapted the GLM‑4.x series language models to the Kunlun XPU platform by leveraging SGLang and the vLLM‑Kunlun plugin, employing agile adaptation, precision alignment with torch_xray, and extensive performance tuning to achieve GPU‑level accuracy and superior inference speed.
Baidu’s Baige team adapted the GLM‑4.x series language models to the Kunlun XPU platform using the SGLang codebase and the vLLM‑Kunlun plugin, enabling fast deployment for enterprise users while maintaining high inference quality.
Agile Adaptation
Leveraging Kunlun XPU’s high‑performance operator library (Flash Attention, Page Attention, Fused MoE, etc.), the team built the vLLM‑Kunlun plugin that decouples the XPU backend from the vLLM core, allowing seamless migration from GPU to XPU and automatic synchronization with the latest vLLM community releases.
Precision Alignment
The torch_xray tool was used for layer‑wise and operator‑wise numeric alignment of GLM‑4.x inference on XPU. When a precision anomaly was detected, developers quickly traced the cause in the code, fixing it to ensure that XPU outputs match GPU results; the same tool also helped locate new precision issues introduced during performance tuning.
Performance Tuning
Using the XPU‑profiler, the team generated execution timelines and compared them with GPU baselines. Combined with model code analysis, they performed systematic optimizations across three dimensions:
Operator level: identified hot operators from profiling, created fused operators (e.g., moe_ffn_block, moe_gate_ops) to reduce kernel launches and TTFT, optimized Fused MoE to lower memory‑bandwidth usage, and introduced specialized Prefill_attention and Decode_attention operators for long‑sequence parallelism and low‑latency token decoding.
Compute execution level: employed XPU Graph to record and reuse static computation graphs, minimizing CPU‑induced execution bubbles and increasing effective XPU compute time.
Communication level: added an adaptive communication‑threshold mechanism that switches All‑Reduce algorithms—using butterfly for traffic below 2 MB and ring for larger traffic—to achieve optimal bandwidth utilization in both small‑ and large‑scale scenarios.
Through dual‑framework adaptation, a plugin‑driven iterative process, and full‑stack system optimizations, the GLM‑4.x series delivers inference accuracy identical to GPU while achieving stable, high‑performance execution on Kunlun XPU, fully exploiting the hardware’s capabilities.
The Baige team now possesses a reusable adaptation framework for large‑model inference on XPU, supporting rapid deployment across multiple frameworks and meeting performance and cost targets for enterprise AI applications in the domestic market.
Baidu Intelligent Cloud Tech Hub
We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
