Deploying GLM-5 on Baidu Kunlun P800 XPU with vLLM‑Kunlun Plugin

This article explains how Baidu's new GLM-5 large model is adapted to the Kunlun P800 XPU, detailing the async reinforcement learning framework Slime, optimization techniques like INT8 quantization and tensor‑parallelism, and provides step‑by‑step deployment commands using the open‑source vLLM‑Kunlun plugin.

Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Deploying GLM-5 on Baidu Kunlun P800 XPU with vLLM‑Kunlun Plugin

GLM‑5 Release and Kunlun P800 XPU Adaptation

Zhipu released the large language model GLM‑5. Baidu Baige completed Day‑0 adaptation of GLM‑5 to the Kunlun P800 XPU, enabling inference with frameworks such as vLLM and SGLang.

GLM‑5 shows strong performance on complex system‑engineering and long‑range agent tasks, achieving open‑source state‑of‑the‑art results in coding and agent capabilities comparable to Claude Opus 4.5. The asynchronous reinforcement‑learning framework Slime improves the efficiency of GLM‑5’s post‑training RL pipeline, allowing continuous learning from long‑term interactions and boosting general intelligence through larger‑scale pre‑training.

Kunlun Optimizations

Baidu Hybrid Cloud leveraged high‑performance Kunlun operators to adapt GLM‑5’s DSA and Mixture‑of‑Experts (MoE) layers. Optimizations applied include INT8 quantization, Multi‑Tensor Parallelism (MTP), and dual‑machine pipeline parallelism, which significantly increase inference throughput on Kunlun clusters.

vLLM‑Kunlun Plugin

The open‑source vLLM‑Kunlun Plugin provides the adaptation code for GLM‑5. Developers can clone the repository at https://github.com/baidu/vLLM-Kunlun/pull/194 and start the model server with the following configuration:

export XPU_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python -m vllm.entrypoints.openai.api_server \
      --model /GLM-5-W8A8-INT8-Dynamic \
      --tensor-parallel-size 8 \
      --max_num_batched_tokens 8192 \
      --block-size 64 \
      --distributed-executor-backend mp \
      --served-model-name glm-5-w8a8

Deployment Landscape

To date, Baidu Baige has deployed major models—including GLM, DeepSeek, Qwen, MiMo V2, and Kimi—on Kunlun XPU, aiming to make “model‑as‑a‑service” a normal practice and to let developers experience the acceleration benefits of domestic AI chips immediately after launch.

Additional tools such as torch_xray and xpu_profiler are provided for inference accuracy alignment and performance bottleneck analysis, shortening development cycles.

Future Compute Expansion

February 2025: Baidu Cloud activated a self‑developed 10,000‑card AI cluster based on Kunlun P800, the first domestically launched ten‑thousand‑card AI cluster.

April 2025: The cluster was expanded to 32,000 cards, supporting large‑scale training tasks for Baidu’s Qianfan and Steam Engine models.

April 2025: Baidu released a Tianchi super‑node solution with a 32‑card fully‑meshed architecture, achieving 1.5 µs inter‑card latency and low per‑token cost.

Conclusion

The efficient collaboration between GLM‑5 and Kunlun XPU demonstrates rapid convergence of domestic large models and proprietary compute ecosystems. Ongoing efforts focus on open‑sourcing toolchains, deepening hardware‑software co‑optimization, and partnering with model providers to enable scalable, sustainable AI deployments.

model deploymentvLLMreinforcement learningAI accelerationINT8 quantizationKunlun XPUGLM-5
Baidu Intelligent Cloud Tech Hub
Written by

Baidu Intelligent Cloud Tech Hub

We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.