Baidu Intelligent Cloud Tech Hub
Jan 12, 2026 · Artificial Intelligence
How to Reduce Large‑Model Inference Cold‑Start to Seconds with vLLM Optimizations
This article details how Baidu Cloud's hybrid‑cloud team leveraged the vLLM framework to cut the cold‑start time of massive models like Qwen3‑235B‑A22B from minutes to a few seconds through accelerated weight loading, CUDA‑graph capture postponement, cross‑instance state reuse, fork‑based process startup, and guard‑instance pre‑warming techniques.
CUDA Graphcold-start optimizationlarge-model inference
0 likes · 16 min read
