Optimizing Large Model Inference: High‑Performance Frameworks and Techniques
The article reviews high‑performance inference strategies for large language models such as Deepseek‑R1, detailing CPU‑GPU process separation, Paged and Radix Attention, Chunked Prefill, output‑length reduction, tensor‑parallel multi‑GPU scaling, and speculative decoding, each shown to markedly boost throughput and cut latency in real deployments.
Background: the rapid popularity of the Deepseek‑R1 model highlights the growing need for efficient local deployment of large language models (LLMs). This article examines how to boost inference performance and shares practical deployment experiences.
Key performance metrics: throughput (QPS, token/s) and latency (RT, TTFT). Optimizing both is essential for production‑grade services.
CPU‑GPU process separation : split the system into a CPU process (serialization, scheduling, resizing) and a GPU process (CUDA kernels). This design raises GPU utilization from ~2% to 12% and improves QPS from 4.5 to 27.4 in internal tests.
Paged Attention : inspired by OS virtual memory, KV‑cache is managed as fixed‑size pages with a block‑table mapping. The approach eliminates GPU memory fragmentation and can increase throughput several‑fold compared with traditional single‑process designs.
Radix Attention : builds a radix‑tree over shared prompt prefixes, allowing KV‑cache reuse across requests. Experiments show a 30% latency reduction and a 1.5× throughput gain over vanilla VLLM.
Chunked Prefill : long prompts are broken into fixed‑size chunks (e.g., 512 tokens). Each chunk is processed in parallel during the prefill stage, preventing a single long request from blocking concurrent decode stages. This reduces max RT dramatically and roughly doubles average RT under high QPS.
Output‑length reduction : limit max_tokens in the API call, add concise instructions in the prompt (e.g., “output only the result”), or fine‑tune the model to produce shorter answers. Shorter outputs directly lower latency.
Multi‑GPU inference (tensor parallelism) : distribute model weights across multiple GPUs and parallelize attention computation. Benchmarks show RT dropping from 3 s (single‑GPU) to 1.7 s (dual‑GPU) and QPS rising from 1.2 to 2.4.
Speculative Decoding : run a smaller draft model first, then verify its tokens with the large model. For a 70B model, this cuts RT from 11 s to 2.8 s (≈2.8× speed‑up).
Deployment of Deepseek‑R1 with SGLang:
Hardware: 2 × 8 × H20 GPUs; Software: SGLang inference engine.
Launch commands:
python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --dist-init-addr 10.0.0.1:5000 --nnodes 2 --node-rank 0 --trust-remote-code python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --dist-init-addr 10.0.0.1:5000 --nnodes 2 --node-rank 1 --trust-remote-codeThe article concludes with a concise recap of the presented acceleration methods, performance tables, and a list of references for further reading.
DeWu Technology
A platform for sharing and discussing tech knowledge, guiding you toward the cloud of technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.