Artificial Intelligence 15 min read

DeepSeek‑V4 Local Deployment: How SGLang Overcomes the Architecture Challenges

The article analyzes DeepSeek‑V4's architectural innovations—including mixed sparse attention, mHC, and native FP4 weights—explains SGLang's ShadowRadix, HiSparse, and in‑graph speculative decoding solutions, presents benchmark gains, provides Docker deployment steps, and warns of key pitfalls for long‑context inference.

Old Zhang's AI Learning

May 1, 2026

DeepSeek‑V4 Local Deployment: How SGLang Overcomes the Architecture Challenges

What changed in V4

Background: DeepSeek released two variants, V4‑Flash (284B total, 13B active) and V4‑Pro (1.6T total, 49B active). Both use FP4 MoE expert weights + FP8 attention/dense mixed‑precision checkpoints, a single weight file that works on FP4‑capable GPUs (Hopper, Blackwell, AMD, NPU) under MIT license, 1M context, >32T tokens pre‑training.

The architectural innovations consist of three parts:

Mixed sparse attention (CSA + HCA) : each layer combines a sliding‑window attention (SWA, 128‑token window) with either C4 (4:1 compression + top‑512 sparsity) or C128 (128:1 compression + dense). In 1M‑context scenarios V4‑Pro reduces per‑token inference FLOPs to 27 % of V3.2 and KV cache to 10 %.

mHC (Manifold‑Constrained Hyper‑Connection) : replaces traditional residual connections with a set of parallel branches weighted by a Sinkhorn‑normalized mixture, improving gradient flow and representation quality.

Native FP4 expert weights : directly leverages Blackwell’s FP4 tensor‑core advantage, eliminating bandwidth bottlenecks for small‑batch decode.

Additionally V4 introduces a single‑layer MTP head for speculative decoding with three reasoning modes: Non‑think (intuition), Think High (chain‑of‑thought), and Think Max (max‑depth, recommended for ≥384K context).

What SGLang did

The mixed attention creates a “three‑set heterogeneous KV pool + two compression state pools” problem, breaking the traditional prefix‑cache assumption. SGLang addresses this with several tightly integrated components.

ShadowRadix : builds a radix tree indexing “virtual full‑token slots” and projects them into physical pools (SWA / C4 / C128). The address formula is swa_page * ring_size + pos % ring_size. Each node holds two counters ( full_lock_ref and swa_lock_ref) to manage source and shadow pools, enabling reuse of compressed KV without extra tracking cost.

HiSparse : moves the majority of inactive KV (especially C4) to CPU memory. A fixed CPU mirror holds the KV pool while GPU retains a small active working set; a coordinator asynchronously swaps pages using LRU eviction.

Effect: on a 2×B200 V4‑Flash setup processing 200K input / 20K output long‑context tokens, peak throughput increases up to 3×.

HiSparse architecture and peak throughput

MTP speculative decoding + in‑graph metadata : per‑pass metadata (SWA page index, shadow mapping, compression plans, pool write positions) is baked into a CUDA graph. Replay copies only the batch state; all index arithmetic runs in device kernels, eliminating Python‑side scheduling overhead.

Result: decoding throughput stays flat from 4K to 900K context, with less than 10 % drop in token/s for both B200 (199→180) and H200 (266→240), a “flatness” previously unseen in long‑context inference.

Kernel innovations

FlashMLA new interface : combines SWA and extra attention (C4/C128) in a single kernel, sharing metadata in forward.

Flash Compressor : compresses five HBM passes of sparse attention into one on‑chip pass (HBM 5→2), achieving 80 % peak bandwidth and >10× speed‑up over naive PyTorch pipelines.

Lightning TopK : replaces full sort of 256K candidates with an 8‑way radix‑select, reducing per‑batch cost from >100 µs to ~15 µs for top‑512 selection at 1M context.

FlashInfer TRTLLM‑Gen MoE : uses MXFP8 activation with MXFP4 expert weights to exploit Blackwell FP4 tensor cores.

DeepGEMM Mega MoE : fuses EP dispatch, first FP8×FP4 GEMM, SwiGLU, second GEMM, and EP combine into a mega kernel, overlapping NVLink communication with tensor‑core compute.

TileLang mHC kernels (with split‑K) : mitigates low‑latency decode bottlenecks by splitting the K‑dimension.

DP/TP/CP attention, DeepEP EP MoE, PD disaggregated deployment : provides a full suite of parallel strategies.

Deployment

SGLang provides Docker images for each hardware platform:

NVIDIA B300 – lmsysorg/sglang:deepseek-v4-b300 NVIDIA B200 – lmsysorg/sglang:deepseek-v4-blackwell NVIDIA GB200/GB300 – lmsysorg/sglang:deepseek-v4-grace-blackwell NVIDIA H200 – lmsysorg/sglang:deepseek-v4-hopper Minimal launch command (example for Blackwell):

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<your-hf-token>" \
    --ipc=host \
    lmsysorg/sglang:deepseek-v4-blackwell \
    sglang serve <use args below>

Standard OpenAI‑compatible API call:

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V4-Flash",
    "messages": [{"role": "user", "content": "What is 15% of 240?"}]
}'

Three preset configurations: low‑latency: MTP steps=3, draft‑tokens=4, best for batch size 1. balanced: MTP steps=1, draft‑tokens=2, more balanced at higher batch. max‑throughput: disables MTP, optimal for saturated workloads.

Additional specialized recipes: cp (prefill parallelism for long context) and pd‑disagg (prefill/decode disaggregation).

Gotchas

DeepEP dispatch buffer must satisfy

max-running-requests × MTP_draft_tokens ≤ SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK

; otherwise the buffer overflows.

H200 (Hopper) has two paths: original FP4 checkpoint (Marlin w4a16 MoE kernel, TP‑only) and SGLang‑converted FP8 checkpoint ( sgl-project/DeepSeek-V4-Flash-FP8 / Pro-FP8).

PD‑Disagg on H200 requires --privileged --ulimit memlock=-1 or InfiniBand device access; otherwise large checkpoints may suffer KV transmission errors.

When using the base model, set SGLANG_FIX_DSV4_BASE_MODEL_LOAD=1.

For GB300 cross‑pod NVLink, add MC_FORCE_MNNVL=1 NCCL_MNNVL_ENABLE=1 NCCL_CUMEM_ENABLE=1 to both prefilling and decoding.

Conclusion

DeepSeek‑V4 pushes the open‑source LLM frontier by cutting inference FLOPs to 27 % and KV cache to 10 % for 1M context, making long‑context a default capability. The price is a near‑complete rewrite of KV, cache, and attention handling in inference engines. SGLang achieves Day‑0 readiness through ShadowRadix, HiSparse, in‑graph speculative metadata, and a suite of new kernels, rather than a few patches.

Benchmarks from LMSYS show SGLang’s decode throughput remains flat from 4K to 900K context, with less than 10 % token‑rate drop on both B200 and H200, outperforming another open‑source engine in the same 30K‑context single‑batch test.

For readers interested in the engineering details, the main PR is sgl-project/sglang#23600, which registers V4Config, JIT kernel dtype mapping, FP8 weight post‑processing, Triton fallback for MLA on SM_120, and Marlin fallback for MXFP4 MoE, among other contributions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

speculative decoding SGLang long-context inference DeepSeek-V4 HiSparse mixed sparse attention ShadowRadix

Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.