Artificial Intelligence 22 min read

Optimizing Large Model Inference: High‑Performance Frameworks and Techniques

The article reviews high‑performance inference strategies for large language models such as Deepseek‑R1, detailing CPU‑GPU process separation, Paged and Radix Attention, Chunked Prefill, output‑length reduction, tensor‑parallel multi‑GPU scaling, and speculative decoding, each shown to markedly boost throughput and cut latency in real deployments.

DeWu Technology

Feb 17, 2025

Optimizing Large Model Inference: High‑Performance Frameworks and Techniques

Background: the rapid popularity of the Deepseek‑R1 model highlights the growing need for efficient local deployment of large language models (LLMs). This article examines how to boost inference performance and shares practical deployment experiences.

Key performance metrics: throughput (QPS, token/s) and latency (RT, TTFT). Optimizing both is essential for production‑grade services.

CPU‑GPU process separation : split the system into a CPU process (serialization, scheduling, resizing) and a GPU process (CUDA kernels). This design raises GPU utilization from ~2% to 12% and improves QPS from 4.5 to 27.4 in internal tests.

Paged Attention : inspired by OS virtual memory, KV‑cache is managed as fixed‑size pages with a block‑table mapping. The approach eliminates GPU memory fragmentation and can increase throughput several‑fold compared with traditional single‑process designs.

Radix Attention : builds a radix‑tree over shared prompt prefixes, allowing KV‑cache reuse across requests. Experiments show a 30% latency reduction and a 1.5× throughput gain over vanilla VLLM.

Chunked Prefill : long prompts are broken into fixed‑size chunks (e.g., 512 tokens). Each chunk is processed in parallel during the prefill stage, preventing a single long request from blocking concurrent decode stages. This reduces max RT dramatically and roughly doubles average RT under high QPS.

Output‑length reduction : limit max_tokens in the API call, add concise instructions in the prompt (e.g., “output only the result”), or fine‑tune the model to produce shorter answers. Shorter outputs directly lower latency.

Multi‑GPU inference (tensor parallelism) : distribute model weights across multiple GPUs and parallelize attention computation. Benchmarks show RT dropping from 3 s (single‑GPU) to 1.7 s (dual‑GPU) and QPS rising from 1.2 to 2.4.

Speculative Decoding : run a smaller draft model first, then verify its tokens with the large model. For a 70B model, this cuts RT from 11 s to 2.8 s (≈2.8× speed‑up).

Deployment of Deepseek‑R1 with SGLang:

Hardware: 2 × 8 × H20 GPUs; Software: SGLang inference engine.

Launch commands:

python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --dist-init-addr 10.0.0.1:5000 --nnodes 2 --node-rank 0 --trust-remote-code

python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --dist-init-addr 10.0.0.1:5000 --nnodes 2 --node-rank 1 --trust-remote-code

The article concludes with a concise recap of the presented acceleration methods, performance tables, and a list of references for further reading.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI Distributed Inference GPU Acceleration large model inference speculative decoding

Written by

DeWu Technology

A platform for sharing and discussing tech knowledge, guiding you toward the cloud of technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.