Deploying DeepSeek‑V4‑Flash Locally on 2 × NVIDIA H20 (96 GB) – Quick Performance Test

This article walks through deploying DeepSeek‑V4‑Flash on a server with two NVIDIA H20 GPUs (96 GB each), detailing model download, Docker image preparation, launch script tweaks, memory compression via FP8 and expert parallelism, and reports observed concurrency limits and token‑per‑second speeds, including a test that disables the model's thinking mode.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
Deploying DeepSeek‑V4‑Flash Locally on 2 × NVIDIA H20 (96 GB) – Quick Performance Test

The author uses a workstation equipped with an Intel Xeon Platinum 8457C CPU, 480 GiB RAM, and two NVIDIA H20 GPUs (each 96 GB VRAM), driver 580.126.09, CUDA 13.0, a 100 GB system disk and a 1 TB data disk.

Model download

The 160 GB DeepSeek‑V4‑Flash model is fetched via ModelScope:

modelscope download --model deepseek-ai/DeepSeek-V4-Flash --local_dir /data/models/DeepSeek-V4-Flash

vLLM Docker image

Because installing vllm-nightly directly often fails, the author pulls the pre‑built Docker image:

docker pull vllm/vllm-openai:deepseekv4-cu129

Launch script

After several trial‑and‑error attempts that resulted in OOM errors, the following Docker run command works on the 2 × H20 setup:

docker run -d \
  --name vllm-deepseek-v4-flash \
  --restart unless-stopped \
  --gpus all \
  --privileged \
  --ipc=host \
  -p 8000:8000 \
  -v /data/models:/models:ro \
  -e VLLM_ENGINE_READY_TIMEOUT_S=3600 \
  vllm/vllm-openai:deepseekv4-cu129 \
  /models/DeepSeek-V4-Flash \
  --trust-remote-code \
  --kv-cache-dtype fp8 \
  --block-size 256 \
  --enable-expert-parallel \
  --data-parallel-size 2 \
  --gpu-memory-utilization 0.95 \
  --max-model-len 7000 \
  --tokenizer-mode deepseek_v4 \
  --tool-call-parser deepseek_v4 \
  --enable-auto-tool-choice \
  --enforce-eager

The default maximum sequence length of the model is 1,048,576 tokens, which is infeasible on this hardware; the author therefore limits --max-model-len to 7 K.

Runtime observations

The original Safetensors weight file is 148.66 GiB; after FP8 quantization and Expert Parallelism (EP) each worker loads only 77.6 GiB.

After accounting for weights and system reservations, about 9.29 GiB remains for cache.

Maximum concurrency for 7,000‑token requests is 3.72×, meaning roughly 3.7 concurrent requests can be served for long prompts.

The model contains 256 experts; with data‑parallel size 2, each worker maintains 128 experts, spreading memory pressure across GPUs.

Logs show the use of DeepSeek’s fp8_ds_mla KV‑cache format, which applies low‑rank compression (Multi‑head Latent Attention) to reduce memory bandwidth usage.

TileLang successfully compiled kernels such as mhc_pre_big_fuse_tilelang.

Engine initialization (profile, cache creation, warm‑up) takes about 233 seconds, dominated by a 2 min 36 s DeepGEMM warm‑up.

Performance snapshot

With the above configuration the average generation speed is 8.33 tokens / s. The author notes that this is surprisingly high for the H20 GPUs.

Disabling “thinking”

By turning off the model’s “thinking” feature via the API, the author reran ten identical prompts (max_tokens = 1024) and observed a speed increase: after further script tweaks (changing data‑parallel to tensor‑parallel, disabling enforce-eager), throughput reached ~20 tokens / s, and with “thinking” disabled it climbed above 70 tokens / s.

The author plans to continue tuning for higher performance.

DockerGPU deploymentFP8 quantizationLLM performanceDeepSeek V4
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.