Baidu Intelligent Cloud Tech Hub
Author

Baidu Intelligent Cloud Tech Hub

We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.

137
Articles
0
Likes
314
Views
0
Comments
Recent Articles

Latest from Baidu Intelligent Cloud Tech Hub

100 recent articles max
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Mar 23, 2026 · Artificial Intelligence

How vLLM‑Kunlun Unlocks Peak LLM Performance on Kunlun XPU

This article details the technical challenges of adapting the open‑source vLLM inference framework to Baidu's Kunlun XPU, outlines four major performance bottlenecks, and presents a multi‑dimensional optimization roadmap—including custom plugins, operator fusion, INT8 quantization, and CUDA‑Graph techniques—that together boost throughput by up to 8% and narrow the gap with leading GPU hardware.

CUDA GraphINT8 QuantizationKunlun XPU
0 likes · 13 min read
How vLLM‑Kunlun Unlocks Peak LLM Performance on Kunlun XPU
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Mar 18, 2026 · Artificial Intelligence

How vLLM‑Kunlun Brings CUDA‑Like Inference to Kunlun XPU: Architecture, Adaptation, and Performance Wins

This article details the vLLM‑Kunlun open‑source project that adapts the high‑performance vLLM inference engine to Baidu's Kunlun XPU, covering platform overview, model‑porting workflow, plugin architecture, concrete case studies with MIMO‑Flash‑V2 and Qwen 3.5, and the performance‑tuning techniques that enable seamless, GPU‑level inference on domestic hardware.

AIKunlunPerformance
0 likes · 12 min read
How vLLM‑Kunlun Brings CUDA‑Like Inference to Kunlun XPU: Architecture, Adaptation, and Performance Wins
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Mar 6, 2026 · Artificial Intelligence

How Baidu’s End‑to‑End Quantization Stack Supercharges Large‑Model Inference on Kunlun XPU

Baidu Baige built a full‑stack quantization pipeline that integrates model‑level, framework‑level, and hardware‑level optimizations on the Kunlun XPU platform, enabling FP16/BF16 large models to be compressed to 25‑50% of their original size while boosting inference speed by 30‑50% and dramatically reducing memory consumption for enterprise deployments.

AI inferenceINT4INT8
0 likes · 16 min read
How Baidu’s End‑to‑End Quantization Stack Supercharges Large‑Model Inference on Kunlun XPU
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Feb 12, 2026 · Artificial Intelligence

Deploying GLM-5 on Baidu Kunlun P800 XPU with vLLM‑Kunlun Plugin

This article explains how Baidu's new GLM-5 large model is adapted to the Kunlun P800 XPU, detailing the async reinforcement learning framework Slime, optimization techniques like INT8 quantization and tensor‑parallelism, and provides step‑by‑step deployment commands using the open‑source vLLM‑Kunlun plugin.

AI accelerationGLM-5INT8 Quantization
0 likes · 6 min read
Deploying GLM-5 on Baidu Kunlun P800 XPU with vLLM‑Kunlun Plugin
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Feb 6, 2026 · Artificial Intelligence

Accelerating GLM‑4.x Inference on Kunlun XPU with SGLang & vLLM

Baidu’s Baige team successfully adapted the GLM‑4.x series language models to the Kunlun XPU platform by leveraging SGLang and the vLLM‑Kunlun plugin, employing agile adaptation, precision alignment with torch_xray, and extensive performance tuning to achieve GPU‑level accuracy and superior inference speed.

AIXPUhardware acceleration
0 likes · 6 min read
Accelerating GLM‑4.x Inference on Kunlun XPU with SGLang & vLLM
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Jan 27, 2026 · Artificial Intelligence

Deploying Qwen3 on Kunlun P800: Full‑Parameter DPO Training and Inference Guide

This guide walks through setting up a Kunlun P800 XPU host, preparing Docker containers, deploying Qwen3‑8B/‑32B/‑VL models with vLLM‑Kunlun, benchmarking performance, and running full‑parameter DPO training using LLaMA‑Factory, providing scripts, configuration files, and troubleshooting tips for AI engineers.

DPOKunlun P800LLaMA-Factory
0 likes · 32 min read
Deploying Qwen3 on Kunlun P800: Full‑Parameter DPO Training and Inference Guide
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Jan 20, 2026 · Artificial Intelligence

How LoongFlow Enables Expert‑Level AI Agents to Outperform Human Mathematicians

LoongFlow is an open‑source AI agent framework that combines a Plan‑Execute‑Summarize (PES) paradigm with a Hybrid Evolutionary Memory system, allowing agents to perform directed, iterative problem solving and achieve state‑of‑the‑art results on mathematical challenges, Kaggle‑style benchmarks, and real‑world tasks with dramatically higher efficiency.

BenchmarkingEvolutionary AlgorithmsLoongFlow
0 likes · 15 min read
How LoongFlow Enables Expert‑Level AI Agents to Outperform Human Mathematicians
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Jan 16, 2026 · Artificial Intelligence

How LoongFlow Empowers Expert‑Level AI Agents to Surpass Human Mathematicians

LoongFlow is an open‑source AI agent framework that combines a Plan‑Execute‑Summarize (PES) paradigm with a Hybrid Evolutionary Memory system to enable agents to perform long‑range, complex reasoning, achieving record‑breaking results on mathematical challenges and real‑world ML benchmarks while dramatically improving efficiency.

LoongFlowMachine learning benchmarksPES paradigm
0 likes · 15 min read
How LoongFlow Empowers Expert‑Level AI Agents to Surpass Human Mathematicians
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Jan 12, 2026 · Artificial Intelligence

How to Reduce Large‑Model Inference Cold‑Start to Seconds with vLLM Optimizations

This article details how Baidu Cloud's hybrid‑cloud team leveraged the vLLM framework to cut the cold‑start time of massive models like Qwen3‑235B‑A22B from minutes to a few seconds through accelerated weight loading, CUDA‑graph capture postponement, cross‑instance state reuse, fork‑based process startup, and guard‑instance pre‑warming techniques.

CUDA Graphcold-start optimizationlarge-model inference
0 likes · 16 min read
How to Reduce Large‑Model Inference Cold‑Start to Seconds with vLLM Optimizations