Artificial Intelligence 9 min read

Configuring vLLM swap_space and cpu_offload_gb for Stable Large-Model Inference

The article explains vLLM’s GPU compute capability requirement, describes the swap_space and cpu_offload_gb parameters, outlines their ideal usage scenarios, and provides step‑by‑step code examples that demonstrate how adjusting these settings enables loading and running a 7B‑parameter model on a 16 GB T4 GPU.

Infra Learning Club

Nov 1, 2024

Configuring vLLM swap_space and cpu_offload_gb for Stable Large-Model Inference

vLLM GPU Requirement

vLLM only runs on GPUs whose Compute Capability version is greater than 7.0. Using a GPU with a lower version triggers the error

CUDA error: no kernel image is available for execution on the device

swap_space Parameter

swap_space specifies the amount of CPU memory (GiB) that each GPU can use as swap space. It is needed when the best_of sampling parameter is greater than 1, because the model must temporarily store multiple candidate outputs. If best_of is always 1, swap_space can be set to 0. Setting it too low may cause out‑of‑memory (OOM) errors.

Typical scenarios for enabling swap_space :

Multiple request scenario : When handling several concurrent requests with best_of > 1, the model keeps several candidate outputs in memory. Proper swap_space avoids OOM.

Large‑model inference : Complex models consume a lot of GPU memory; increasing swap_space provides temporary storage for model state.

Dynamic request handling : Varying request volume benefits from a well‑tuned swap_space to improve adaptability and reduce memory‑related failures.

cpu_offload_gb Parameter

cpu_offload_gb defines how many gigabytes of CPU memory are used to offload model weights. Offloading frees GPU memory but introduces a data‑transfer cost for each forward pass.

Use cases for cpu_offload_gb :

Large‑model training and inference : When model weights cannot fully fit into GPU memory, part of them can be moved to CPU memory, freeing GPU space for other operations.

Reducing GPU memory pressure : Applications that run multiple large models or handle high concurrency can lower GPU memory usage and improve overall performance.

Latency‑sensitive scenarios : Properly balancing cpu_offload_gb can reduce request latency despite the added CPU‑GPU transfer overhead.

vLLM’s offload technology moves objects (weights or KV cache) to external resources such as CPU memory or NVMe/Disk, reducing the memory load on the primary GPU.

Practical Test Cases

Test 1: Load Qwen/Qwen2.5-7B-Instruct, which needs roughly 20‑24 GB VRAM in FP16. A T4 GPU with 16 GB cannot load the model by default.

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Attempt without offloading
llm = LLM(model="Qwen/Qwen2.5-7B-Instruct", dtype="float16", swap_space=0, cpu_offload_gb=0)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print(f"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}")

Result: Model fails to load.

Test 2: Enable cpu_offload_gb=10 GiB.

llm = LLM(model="Qwen/Qwen2.5-7B-Instruct", dtype="float16", swap_space=0, cpu_offload_gb=10)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print(f"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}")

Result: Model loads successfully, confirming that part of the model state is offloaded to CPU memory.

Test 3: Use best_of=5 without swap space.

sampling_params = SamplingParams(temperature=0.8, top_p=0.95, best_of=5)
llm = LLM(model="Qwen/Qwen2.5-7B-Instruct", dtype="float16", swap_space=0, cpu_offload_gb=10)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print(f"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}")

Result: Generation succeeds, showing that cpu_offload_gb alone can handle the increased memory demand of multiple candidates.

Test 4: Enable both swap_space=5 GiB and cpu_offload_gb=10 GiB with best_of=5, and measure execution time.

import time
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, best_of=5)
llm = LLM(model="Qwen/Qwen2.5-7B-Instruct", dtype="float16", swap_space=5, cpu_offload_gb=10)
start_time = time.time()
outputs = llm.generate(prompts, sampling_params)
execution_time = time.time() - start_time
print(f"llm generate time: {execution_time:.6f} seconds")
for output in outputs:
    print(f"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}")

This configuration demonstrates that a combination of swap space and CPU offloading can successfully run a 7B‑parameter model on a 16 GB GPU while providing measurable performance data.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

vLLM swap_space GPU memory management large language model inference cpu_offload_gb offloading

Written by

Infra Learning Club

Infra Learning Club shares study notes, cutting-edge technology, and career discussions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.