How to Reduce Large‑Model Inference Cold‑Start to Seconds with vLLM Optimizations

This article details how Baidu Cloud's hybrid‑cloud team leveraged the vLLM framework to cut the cold‑start time of massive models like Qwen3‑235B‑A22B from minutes to a few seconds through accelerated weight loading, CUDA‑graph capture postponement, cross‑instance state reuse, fork‑based process startup, and guard‑instance pre‑warming techniques.

Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
How to Reduce Large‑Model Inference Cold‑Start to Seconds with vLLM Optimizations

Cold‑Start Overhead Analysis

Traditional Kubernetes pod‑level autoscaling only speeds up container launch by pre‑pulling images or warming nodes, but it cannot optimize the internal steps of a large‑model inference engine such as model weight loading, torch.compile JIT compilation, and CUDA‑graph capture. Consequently, a full cold start can take close to ten minutes, and the latency grows with model size, making rapid scaling and recovery impractical.

The complete cold‑start timeline for a Qwen3‑235B‑A22B service shows that Load Model Weight , torch.compile , CUDA‑graph capture and import packages dominate the total time.

Load Model Weight: transfers .safetensors from disk to GPU memory.

torch.compile: JIT‑compiles model code into an efficient execution graph.

CUDA‑graph capture: records GPU kernels to eliminate CPU‑GPU sync overhead.

Import packages: loads dependent libraries into memory.

The optimization goal is to shorten this cold‑start path without sacrificing inference performance, thereby enabling fast scaling and disaster recovery.

Inference Service Startup Optimizations

1. Cross‑instance Model Weight Loading Acceleration

Weight loading is the biggest bottleneck during scaling. By using NVLink for intra‑node transfers and RDMA for inter‑node transfers, the system bypasses the slow disk‑>CPU‑>GPU path and streams the 348 GB weight file directly to the target GPUs in roughly two seconds.

2. Cross‑instance Intermediate‑State Reuse

Two categories of intermediate states are identified:

Reusable states (e.g., model_infos, deep_gemm) that depend only on model and engine versions.

Hash‑matched caches (e.g., torch_compile_cache, inductor_cache) that require a consistent environment. The team introduced a consistent‑hash mechanism to achieve near‑100 % cache hit rates during scaling.

3. Lazy CUDA‑graph Capture

During initialization, capture only the minimal necessary graph (usually the largest one) to set up the GPU memory pool.

When the first inference request arrives, capture all remaining graphs in one batch.

Subsequent requests reuse the cached graphs directly.

This reduces the torch.compile + CUDA‑graph overhead from 10‑60 seconds to 1‑2 seconds.

4. Fork‑based Multi‑process Initialization

vLLM normally spawns child processes, which isolates CUDA contexts but adds overhead. By confirming that no CUDA context exists before worker creation, the team safely switched to fork for certain subprocesses, inheriting the already‑loaded Python packages and cutting process‑startup time.

5. Guard‑Instance Pre‑warming

Guard instances keep a minimal CUDA context in GPU memory while releasing model weights and KV cache. When scaling is needed, the instance is quickly awakened, the accelerated weight transfer loads the model, and the retained context speeds up compilation and graph capture. In tests, this end‑to‑end scaling took only six seconds.

Best‑Practice Scenarios

Scenario 1 – User‑experience‑critical services with idle resources: enable guard‑instance pre‑warming for the fastest startup.

Scenario 2 – Disaster‑recovery or low‑priority scaling without idle resources: disable guard instances and rely on lazy CUDA‑graph capture.

Scenario 3 – Infrequent scaling with limited resources: use cross‑instance weight transfer, state reuse, and fork‑based startup without guard instances or lazy graph capture.

Conclusion

By dissecting the inference engine’s cold‑start path and applying targeted optimizations—accelerated weight loading, intermediate‑state reuse, lazy CUDA‑graph capture, fork‑based process creation, and guard‑instance pre‑warming—the team built a flexible, elastic scaling solution for large‑model services. The approach dramatically reduces startup latency across diverse resource conditions while preserving inference throughput, providing a practical reference for building high‑performance, cost‑effective inference infrastructure.

Cold‑start timeline diagram
Cold‑start timeline diagram
Cross‑instance weight transfer architecture
Cross‑instance weight transfer architecture
Lazy CUDA‑graph capture flow
Lazy CUDA‑graph capture flow
large-model inferencevLLMCUDA Graphcold-start optimizationscalable deployment
Baidu Intelligent Cloud Tech Hub
Written by

Baidu Intelligent Cloud Tech Hub

We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.