Why Does GPU Memory Keep Growing in DeepSeek‑R1 Inference? Uncovering PyTorch’s Cache
After deploying the full‑precision DeepSeek‑R1 model on a 2×8‑GPU ACS cluster, repeated stress tests showed GPU memory usage continuously rising without release; this article details the investigation, reproduces the behavior, examines vLLM logs, Prometheus metrics, and reveals PyTorch’s caching allocator as the root cause, offering mitigation tips.
Background
A customer deployed the full‑precision DeepSeek‑R1 671B model on a distributed ACS setup (2×8 GPUs) using vLLM 0.7.2. After multiple stress‑test runs, GPU memory usage kept increasing and never decreased.
Key Questions
Who consumes the additional memory?
Does memory growth have an upper bound?
If it keeps growing, how can the memory be released?
Environment Details
Model: DeepSeek‑R1 671B
GPU: 2×8 GPU RDMA (ACS distributed deployment)
Inference framework: vLLM 0.7.2
Launch command:
vllm serve /data/DeepSeek-R1/ --port 8000 --trust-remote-code --served-model-name ds --max-model-len 8192 --gpu-memory-utilization 0.95 --tensor-parallel-size 16 --enforce-eager --api-key token_xxxx --max-num-seqs 128Initial Investigation
vLLM logs show per‑GPU memory allocation:
INFO 03-20 19:40:50 worker.py:267] Memory profiling takes 16.56 seconds
INFO 03-20 19:40:50 worker.py:267] the current vLLM instance can use total_gpu_memory (9x.00 GiB) x gpu_memory_utilization (0.95) = 9x.00 GiB
INFO 03-20 19:40:50 worker.py:267] model weights take 42.59GiB; non_torch_memory takes 2.30GiB; PyTorch activation peak memory takes 1.42GiB; the rest of the memory reserved for KV Cache is 4x.00 GiB.Each pod contains 8 GPUs, so per‑pod memory breakdown is:
Model weights: 42.59 GiB × 8 = 340.72 GiB
Non‑torch memory: 2.30 GiB × 8 = 18.4 GiB
Activation peak: 1.42 GiB × 8 = 11.36 GiB
KV Cache: 4 GiB × 8 ≈ 3xx GiB
Prometheus monitoring matches the pod‑level usage of ~7xx GiB.
Reproducing the Issue
A curl request was sent to the service:
time curl http://172.17.98.24:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "ds",
"messages": [
{"role": "system", "content": "你是个友善的AI助手。"},
{"role": "user", "content": "介绍一下深度学习好吗。"}
],
"max_tokens": 2048
}'Prometheus showed a 2 GiB increase that persisted for over 2 hours.
KV Cache Check
KV cache utilization stayed around 0.1, indicating no shortage. Additional VLLM metrics confirmed low KV usage, while memory growth was observed in the PyTorch allocation trace.
Prometheus Metric Validation
The metric DCGM_FI_DEV_FB_USED from Prometheus aligns with the Memory‑Usage field of nvidia‑smi, confirming the observed values are real.
sum(DCGM_FI_DEV_FB_USED{PodName=~"", NamespaceName=~""}) by (NamespaceName, PodName)Metric Name
Metric Type
Unit
Description
DCGM_FI_DEV_FB_USED
Gauge
MiB
Frame buffer used; matches nvidia‑smi Memory‑Usage.
Further Diagnosis with Nsight
Nsight capture of vLLM execution showed PyTorch calling cudaMalloc for 246 MiB per GPU (Phase 2: memory increase of 2 GiB), matching the reproduced growth. NCCL initialization occurs at model load, so NCCL is not the culprit.
Blocking Synchronization Test
Setting CUDA_LAUNCH_BLOCKING=1 and switching to a smaller Qwen‑2.5‑1.5B model reproduced the same memory growth pattern, confirming the issue is tied to PyTorch’s memory management rather than model size.
# Blocking sync
export CUDA_LAUNCH_BLOCKING=1
# Sample inference script
from vllm import LLM, SamplingParams
prompt = ["Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is"]
model_name = "Qwen/Qwen2.5-1.5B"
llm = LLM(model=model_name, trust_remote_code=True, gpu_memory_utilization=0.95, max_model_len=8192)
ans = llm.generate(prompts=prompt)
for output in ans:
print(f"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}")PyTorch Cache Mechanism Deep Dive
In model_runner.execute_model, a loop prints torch.cuda.memory_summary(device_id) for each GPU, revealing that GPU 0 and 6 increased by ~1346 MiB in the “from large pool” reserved memory after the test.
The increase originates from PyTorch’s caching allocator, which reserves memory blocks for future use and does not release them until the process ends or torch.cuda.empty_cache() is called.
Pytorch Memory Cache Structures
Key data structures:
Block : Basic unit (stream_id, size, ptr) representing a memory chunk.
BlockPool : Stores free blocks in a std::set ordered by (stream_id, size, address).
DeviceCachingAllocator : Maintains two pools (large_blocks, small_blocks) to speed up allocations.
Allocation Process
When a request arrives, alloc_block obtains a block via cudaMalloc. The allocator may split a larger block into a requested size and a remaining fragment based on thresholds (e.g., max_split_size_mb).
Block* malloc(c10::DeviceIndex device, size_t orig_size, cudaStream_t stream) { ... }Fragmentation and Merging
When a block is freed, try_merge_blocks checks neighboring blocks (prev/next) and merges them if both are free, reducing fragmentation.
size_t try_merge_blocks(Block* dst, Block* src, BlockPool& pool) { ... }Cache Release
The cached memory is only released when torch.cuda.empty_cache() is invoked, which triggers CUDACachingAllocator::emptyCache and ultimately frees large and small blocks.
Conclusions and Recommendations
Memory Growth Origin : The increase is due to PyTorch’s caching allocator reserving memory; the value reported by nvidia‑smi equals reserved memory plus active tensors.
Prevent Fragmentation : Adjust CUDA_PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb to a smaller value to reduce fragmentation.
Additional Optimizations : Use torch.cuda.empty_cache(), consider quantization (INT8/INT4) or offloading strategies (e.g., MoE to CPU) to lower overall GPU memory demand.
Thanks to @复东 and @倪祺 for assistance.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
