Why Does GPU Memory Keep Growing in DeepSeek‑R1 Inference? Uncovering PyTorch’s Cache

After deploying the full‑precision DeepSeek‑R1 model on a 2×8‑GPU ACS cluster, repeated stress tests showed GPU memory usage continuously rising without release; this article details the investigation, reproduces the behavior, examines vLLM logs, Prometheus metrics, and reveals PyTorch’s caching allocator as the root cause, offering mitigation tips.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
Why Does GPU Memory Keep Growing in DeepSeek‑R1 Inference? Uncovering PyTorch’s Cache

Background

A customer deployed the full‑precision DeepSeek‑R1 671B model on a distributed ACS setup (2×8 GPUs) using vLLM 0.7.2. After multiple stress‑test runs, GPU memory usage kept increasing and never decreased.

Key Questions

Who consumes the additional memory?

Does memory growth have an upper bound?

If it keeps growing, how can the memory be released?

Environment Details

Model: DeepSeek‑R1 671B

GPU: 2×8 GPU RDMA (ACS distributed deployment)

Inference framework: vLLM 0.7.2

Launch command:

vllm serve /data/DeepSeek-R1/ --port 8000 --trust-remote-code --served-model-name ds --max-model-len 8192 --gpu-memory-utilization 0.95 --tensor-parallel-size 16 --enforce-eager --api-key token_xxxx --max-num-seqs 128

Initial Investigation

vLLM logs show per‑GPU memory allocation:

INFO 03-20 19:40:50 worker.py:267] Memory profiling takes 16.56 seconds
INFO 03-20 19:40:50 worker.py:267] the current vLLM instance can use total_gpu_memory (9x.00 GiB) x gpu_memory_utilization (0.95) = 9x.00 GiB
INFO 03-20 19:40:50 worker.py:267] model weights take 42.59GiB; non_torch_memory takes 2.30GiB; PyTorch activation peak memory takes 1.42GiB; the rest of the memory reserved for KV Cache is 4x.00 GiB.

Each pod contains 8 GPUs, so per‑pod memory breakdown is:

Model weights: 42.59 GiB × 8 = 340.72 GiB

Non‑torch memory: 2.30 GiB × 8 = 18.4 GiB

Activation peak: 1.42 GiB × 8 = 11.36 GiB

KV Cache: 4 GiB × 8 ≈ 3xx GiB

Prometheus monitoring matches the pod‑level usage of ~7xx GiB.

Reproducing the Issue

A curl request was sent to the service:

time curl http://172.17.98.24:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "ds",
  "messages": [
    {"role": "system", "content": "你是个友善的AI助手。"},
    {"role": "user", "content": "介绍一下深度学习好吗。"}
  ],
  "max_tokens": 2048
}'

Prometheus showed a 2 GiB increase that persisted for over 2 hours.

KV Cache Check

KV cache utilization stayed around 0.1, indicating no shortage. Additional VLLM metrics confirmed low KV usage, while memory growth was observed in the PyTorch allocation trace.

Prometheus Metric Validation

The metric DCGM_FI_DEV_FB_USED from Prometheus aligns with the Memory‑Usage field of nvidia‑smi, confirming the observed values are real.

sum(DCGM_FI_DEV_FB_USED{PodName=~"", NamespaceName=~""}) by (NamespaceName, PodName)

Metric Name

Metric Type

Unit

Description

DCGM_FI_DEV_FB_USED

Gauge

MiB

Frame buffer used; matches nvidia‑smi Memory‑Usage.

Further Diagnosis with Nsight

Nsight capture of vLLM execution showed PyTorch calling cudaMalloc for 246 MiB per GPU (Phase 2: memory increase of 2 GiB), matching the reproduced growth. NCCL initialization occurs at model load, so NCCL is not the culprit.

Blocking Synchronization Test

Setting CUDA_LAUNCH_BLOCKING=1 and switching to a smaller Qwen‑2.5‑1.5B model reproduced the same memory growth pattern, confirming the issue is tied to PyTorch’s memory management rather than model size.

# Blocking sync
export CUDA_LAUNCH_BLOCKING=1

# Sample inference script
from vllm import LLM, SamplingParams
prompt = ["Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is"]
model_name = "Qwen/Qwen2.5-1.5B"
llm = LLM(model=model_name, trust_remote_code=True, gpu_memory_utilization=0.95, max_model_len=8192)
ans = llm.generate(prompts=prompt)
for output in ans:
    print(f"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}")

PyTorch Cache Mechanism Deep Dive

In model_runner.execute_model, a loop prints torch.cuda.memory_summary(device_id) for each GPU, revealing that GPU 0 and 6 increased by ~1346 MiB in the “from large pool” reserved memory after the test.

The increase originates from PyTorch’s caching allocator, which reserves memory blocks for future use and does not release them until the process ends or torch.cuda.empty_cache() is called.

Pytorch Memory Cache Structures

Key data structures:

Block : Basic unit (stream_id, size, ptr) representing a memory chunk.

BlockPool : Stores free blocks in a std::set ordered by (stream_id, size, address).

DeviceCachingAllocator : Maintains two pools (large_blocks, small_blocks) to speed up allocations.

Allocation Process

When a request arrives, alloc_block obtains a block via cudaMalloc. The allocator may split a larger block into a requested size and a remaining fragment based on thresholds (e.g., max_split_size_mb).

Block* malloc(c10::DeviceIndex device, size_t orig_size, cudaStream_t stream) { ... }

Fragmentation and Merging

When a block is freed, try_merge_blocks checks neighboring blocks (prev/next) and merges them if both are free, reducing fragmentation.

size_t try_merge_blocks(Block* dst, Block* src, BlockPool& pool) { ... }

Cache Release

The cached memory is only released when torch.cuda.empty_cache() is invoked, which triggers CUDACachingAllocator::emptyCache and ultimately frees large and small blocks.

Conclusions and Recommendations

Memory Growth Origin : The increase is due to PyTorch’s caching allocator reserving memory; the value reported by nvidia‑smi equals reserved memory plus active tensors.

Prevent Fragmentation : Adjust CUDA_PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb to a smaller value to reduce fragmentation.

Additional Optimizations : Use torch.cuda.empty_cache(), consider quantization (INT8/INT4) or offloading strategies (e.g., MoE to CPU) to lower overall GPU memory demand.

Thanks to @复东 and @倪祺 for assistance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

vLLMDeepSeekPyTorchPerformance debuggingMemory CacheGPU Memory
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.