16 min read

How to Reduce Large‑Model Inference Cold‑Start to Seconds with vLLM Optimizations

This article details how Baidu Cloud's hybrid‑cloud team leveraged the vLLM framework to cut the cold‑start time of massive models like Qwen3‑235B‑A22B from minutes to a few seconds through accelerated weight loading, CUDA‑graph capture postponement, cross‑instance state reuse, fork‑based process startup, and guard‑instance pre‑warming techniques.

Baidu Intelligent Cloud Tech Hub

Jan 12, 2026

How to Reduce Large‑Model Inference Cold‑Start to Seconds with vLLM Optimizations

Cold‑Start Overhead Analysis

Traditional Kubernetes pod‑level autoscaling only speeds up container launch by pre‑pulling images or warming nodes, but it cannot optimize the internal steps of a large‑model inference engine such as model weight loading, torch.compile JIT compilation, and CUDA‑graph capture. Consequently, a full cold start can take close to ten minutes, and the latency grows with model size, making rapid scaling and recovery impractical.

The complete cold‑start timeline for a Qwen3‑235B‑A22B service shows that Load Model Weight , torch.compile , CUDA‑graph capture and import packages dominate the total time.

Load Model Weight: transfers .safetensors from disk to GPU memory.

torch.compile: JIT‑compiles model code into an efficient execution graph.

CUDA‑graph capture: records GPU kernels to eliminate CPU‑GPU sync overhead.

Import packages: loads dependent libraries into memory.

The optimization goal is to shorten this cold‑start path without sacrificing inference performance, thereby enabling fast scaling and disaster recovery.

Inference Service Startup Optimizations

1. Cross‑instance Model Weight Loading Acceleration

Weight loading is the biggest bottleneck during scaling. By using NVLink for intra‑node transfers and RDMA for inter‑node transfers, the system bypasses the slow disk‑>CPU‑>GPU path and streams the 348 GB weight file directly to the target GPUs in roughly two seconds.

2. Cross‑instance Intermediate‑State Reuse

Two categories of intermediate states are identified:

Reusable states (e.g., model_infos, deep_gemm) that depend only on model and engine versions.

Hash‑matched caches (e.g., torch_compile_cache, inductor_cache) that require a consistent environment. The team introduced a consistent‑hash mechanism to achieve near‑100 % cache hit rates during scaling.

3. Lazy CUDA‑graph Capture

During initialization, capture only the minimal necessary graph (usually the largest one) to set up the GPU memory pool.

When the first inference request arrives, capture all remaining graphs in one batch.

Subsequent requests reuse the cached graphs directly.

This reduces the torch.compile + CUDA‑graph overhead from 10‑60 seconds to 1‑2 seconds.

4. Fork‑based Multi‑process Initialization

vLLM normally spawns child processes, which isolates CUDA contexts but adds overhead. By confirming that no CUDA context exists before worker creation, the team safely switched to fork for certain subprocesses, inheriting the already‑loaded Python packages and cutting process‑startup time.

5. Guard‑Instance Pre‑warming

Guard instances keep a minimal CUDA context in GPU memory while releasing model weights and KV cache. When scaling is needed, the instance is quickly awakened, the accelerated weight transfer loads the model, and the retained context speeds up compilation and graph capture. In tests, this end‑to‑end scaling took only six seconds.

Best‑Practice Scenarios

Scenario 1 – User‑experience‑critical services with idle resources: enable guard‑instance pre‑warming for the fastest startup.

Scenario 2 – Disaster‑recovery or low‑priority scaling without idle resources: disable guard instances and rely on lazy CUDA‑graph capture.

Scenario 3 – Infrequent scaling with limited resources: use cross‑instance weight transfer, state reuse, and fork‑based startup without guard instances or lazy graph capture.

Conclusion

By dissecting the inference engine’s cold‑start path and applying targeted optimizations—accelerated weight loading, intermediate‑state reuse, lazy CUDA‑graph capture, fork‑based process creation, and guard‑instance pre‑warming—the team built a flexible, elastic scaling solution for large‑model services. The approach dramatically reduces startup latency across diverse resource conditions while preserving inference throughput, providing a practical reference for building high‑performance, cost‑effective inference infrastructure.