INT8 Quantization and Inference Optimization of DeepSeek R1 Model
Meituan’s search and recommendation team converted the FP8‑only DeepSeek‑R1 model to INT8 by first casting weights to BF16 and then applying block‑wise or channel‑wise quantization, which preserves GSM8K and MMLU accuracy while delivering 33% to 50% higher throughput on A100‑80G GPUs, and they released the SGLang‑based inference scripts and quantized weights publicly, enabling deployment on older NVIDIA hardware without accuracy loss.
DeepSeek R1 originally provides FP8 weights that can only run on the newest NVIDIA GPUs (Ada/Hopper). Meituan’s search and recommendation team applied INT8 quantization, enabling the model to run on older GPUs such as A100 while preserving accuracy and achieving about 50% higher throughput compared with BF16.
Quantization converts high‑precision weights and activations (e.g., BF16) to low‑precision INT8. A typical symmetric INT8 quantization computes a scale factor and inserts quant/dequant operations at appropriate tensor locations.
The team explored two schemes: block‑wise quantization , which splits weight matrices into 128×128 blocks to limit quantization error, and channel‑wise quantization , which quantizes each column independently. Both require converting the native FP8 weights to BF16 first, then to INT8. Block‑wise needs multiple dequant steps during matrix multiplication, while channel‑wise has lower overhead but is more sensitive to outliers.
Evaluation on GSM8K and MMLU showed that both INT8 models retain accuracy comparable to the BF16 and FP8 baselines. Throughput tests on an A100‑80G GPU (using 32 GPUs for a fair comparison) revealed a 33% increase for block‑wise INT8 and up to 50% for channel‑wise INT8 over the BF16 baseline.
Deployment is performed with the open‑source SGLang inference framework. The following commands launch the servers on a two‑node cluster (8 × A100 per node).
# Block‑wise INT8 inference
# Master node
python3 -m sglang.launch_server \
--model meituan/DeepSeek-R1-Block-INT8 \
--tp 16 --dist-init-addr HEAD_IP:5000 \
--nnodes 2 --node-rank 0 --trust-remote \
--enable-torch-compile --torch-compile-max-bs 8
# Worker node
python3 -m sglang.launch_server \
--model meituan/DeepSeek-R1-Block-INT8 \
--tp 16 --dist-init-addr HEAD_IP:5000 \
--nnodes 2 --node-rank 1 --trust-remote \
--enable-torch-compile --torch-compile-max-bs 8
# Channel‑wise INT8 inference
# Master node
python3 -m sglang.launch_server \
--model meituan/DeepSeek-R1-Channel-INT8 \
--tp 16 --dist-init-addr HEAD_IP:5000 \
--nnodes 2 --node-rank 0 --trust-remote \
--enable-torch-compile --torch-compile-max-bs 8 \
--quantization w8a8_int8
# Worker node
python3 -m sglang.launch_server \
--model meituan/DeepSeek-R1-Channel-INT8 \
--tp 16 --dist-init-addr HEAD_IP:5000 \
--nnodes 2 --node-rank 1 --trust-remote \
--enable-torch-compile --torch-compile-max-bs 8 \
--quantization w8a8_int8Example usage via a curl request demonstrates the model’s reasoning ability and correct answer generation.
curl -X POST 'http://HEAD_IP:5000/v1/chat/completions' \
--header 'Content-Type: application/json' \
-d '{
"model": "deepseek-r1",
"messages": [{"role": "user", "content": "下列选项中,找出与众不同的一个:1.铝 2.锡 3.钢 4.铁 5.铜"}]
}'An additional test generated a p5.js script for 100 bouncing balls inside a rotating sphere, showing that the INT8 model’s output is on par with the FP8 version.
write a script for 100 bouncing balls within a sphere, make sure to handle collision detection properly. make the sphere slowly rotate. make sure balls stays within the sphere. implement it in p5.jsIn summary, INT8 quantization of DeepSeek R1, combined with SGLang, unlocks deployment on older GPUs without accuracy loss and with significant throughput gains. All code and quantized weights are publicly released on Hugging Face, inviting further community collaboration.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
