Deploying DeepSeek‑V4‑Flash Locally on 2 × NVIDIA H20 (96 GB) – Quick Performance Test
This article walks through deploying DeepSeek‑V4‑Flash on a server with two NVIDIA H20 GPUs (96 GB each), detailing model download, Docker image preparation, launch script tweaks, memory compression via FP8 and expert parallelism, and reports observed concurrency limits and token‑per‑second speeds, including a test that disables the model's thinking mode.
The author uses a workstation equipped with an Intel Xeon Platinum 8457C CPU, 480 GiB RAM, and two NVIDIA H20 GPUs (each 96 GB VRAM), driver 580.126.09, CUDA 13.0, a 100 GB system disk and a 1 TB data disk.
Model download
The 160 GB DeepSeek‑V4‑Flash model is fetched via ModelScope:
modelscope download --model deepseek-ai/DeepSeek-V4-Flash --local_dir /data/models/DeepSeek-V4-FlashvLLM Docker image
Because installing vllm-nightly directly often fails, the author pulls the pre‑built Docker image:
docker pull vllm/vllm-openai:deepseekv4-cu129Launch script
After several trial‑and‑error attempts that resulted in OOM errors, the following Docker run command works on the 2 × H20 setup:
docker run -d \
--name vllm-deepseek-v4-flash \
--restart unless-stopped \
--gpus all \
--privileged \
--ipc=host \
-p 8000:8000 \
-v /data/models:/models:ro \
-e VLLM_ENGINE_READY_TIMEOUT_S=3600 \
vllm/vllm-openai:deepseekv4-cu129 \
/models/DeepSeek-V4-Flash \
--trust-remote-code \
--kv-cache-dtype fp8 \
--block-size 256 \
--enable-expert-parallel \
--data-parallel-size 2 \
--gpu-memory-utilization 0.95 \
--max-model-len 7000 \
--tokenizer-mode deepseek_v4 \
--tool-call-parser deepseek_v4 \
--enable-auto-tool-choice \
--enforce-eagerThe default maximum sequence length of the model is 1,048,576 tokens, which is infeasible on this hardware; the author therefore limits --max-model-len to 7 K.
Runtime observations
The original Safetensors weight file is 148.66 GiB; after FP8 quantization and Expert Parallelism (EP) each worker loads only 77.6 GiB.
After accounting for weights and system reservations, about 9.29 GiB remains for cache.
Maximum concurrency for 7,000‑token requests is 3.72×, meaning roughly 3.7 concurrent requests can be served for long prompts.
The model contains 256 experts; with data‑parallel size 2, each worker maintains 128 experts, spreading memory pressure across GPUs.
Logs show the use of DeepSeek’s fp8_ds_mla KV‑cache format, which applies low‑rank compression (Multi‑head Latent Attention) to reduce memory bandwidth usage.
TileLang successfully compiled kernels such as mhc_pre_big_fuse_tilelang.
Engine initialization (profile, cache creation, warm‑up) takes about 233 seconds, dominated by a 2 min 36 s DeepGEMM warm‑up.
Performance snapshot
With the above configuration the average generation speed is 8.33 tokens / s. The author notes that this is surprisingly high for the H20 GPUs.
Disabling “thinking”
By turning off the model’s “thinking” feature via the API, the author reran ten identical prompts (max_tokens = 1024) and observed a speed increase: after further script tweaks (changing data‑parallel to tensor‑parallel, disabling enforce-eager), throughput reached ~20 tokens / s, and with “thinking” disabled it climbed above 70 tokens / s.
The author plans to continue tuning for higher performance.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
