How PaddlePaddle 3.0 Boosts Large‑Model Inference with 4‑Bit Quantization and MLA Optimizations
PaddlePaddle 3.0 introduces a full‑stack inference engine that supports FP8, INT8, and 4‑bit quantization for popular LLMs such as DeepSeek V3/R1, delivers up to 2× token throughput on a single H800 GPU, and provides detailed deployment scripts for single‑node and multi‑node setups, including MTP speculative decoding and SageAttention for long‑sequence acceleration.
PaddlePaddle 3.0 dramatically upgrades large‑model inference by adding high‑performance support for multiple quantization precisions (FP8, INT8, 4‑bit) and a flexible intermediate representation (PIR) that optimizes model compression, compute, and deployment across various hardware.
Key Model Support and Performance
The framework now fully supports DeepSeek V3/R1 (full‑precision and distilled versions) with FP8 inference, achieving over 1,000 tokens / s on a single H800 GPU and up to 2,000 tokens / s when using the 4‑bit quantized deployment, effectively doubling throughput compared to previous releases.
MLA Operator Optimizations
MLA kernels have been re‑engineered with multi‑stage pipelines, fine‑tuned register and shared‑memory allocation, and warp‑group scheduling. Compared with FlashMLA, performance gains range from 4 % to 23 %.
Three‑warp‑group scheduling processes 64‑length KV sequences using 225 KB of shared memory. A four‑warp‑group variant overlaps PV GEMM with Softmax, handling 32‑length KV sequences and reducing register pressure (max 128 registers per thread). These optimizations yield a 4 %–23 % speedup on Hopper GPUs.
MTP Speculative Decoding and Throughput
The new speculative decoding mechanism decouples the base model from the decoding method, supporting draft models, MTP/Eagle, and reference‑matching paradigms with minimal code changes. Optimizations keep batch size constant while validating all draft tokens in a single pass, improving QPS by 144 % and decoding speed by 42 % without degrading latency.
Long‑Sequence Attention Quantization (SageAttention)
SageAttention dynamically quantizes Q/K to INT8 and V to FP8, reorganizing the attention stages. INT32 Q·K results are converted to FP32 for softmax, then re‑quantized, while the final output is FP8. This approach preserves near‑lossless accuracy and accelerates the Prefill stage for 64K‑token inputs by 37.4 %.
Deployment Scripts
One‑click scripts are provided for rapid service launch. Example single‑node Docker command (H800, 4‑bit quantization):
export MODEL_PATH=${MODEL_PATH:-$PWD}
export model_name="deepseek-ai/DeepSeek-R1/weight_only_int4"
docker run --gpus all --shm-size 32G --network=host --privileged --cap-add=SYS_PTRACE \
-v $MODEL_PATH:/models \
-e "model_name=${model_name}" \
-e "MP_NUM=8" \
-e "CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7" \
-dit ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddlenlp:llm-serving-cuda124-cudnn9-v2.1 /bin/bash \
-c -ex 'start_server $model_name && tail -f /dev/null' && docker logs -f $(docker ps -lq)For two‑node deployment, set POD_0_IP and POD_IPS, adjust MP_NUM and MP_NNODE, and run a similar Docker command with the appropriate environment variables.
Inference Requests
Example curl request:
curl 127.0.0.1:9965/v1/chat/completions -H 'Content-Type: application/json' \
-d '{"model":"default","text":"Hello, how are you?"}'Python OpenAI‑compatible client usage is also demonstrated.
Summary
PaddlePaddle 3.0 offers a complete stack—model compression tools, a high‑performance inference engine, and service‑oriented deployment—supporting a wide range of LLMs (DeepSeek, Qwen, Llama, Mixtral) and delivering precision‑preserving quantization (INT8, FP8, INT4) with substantial speedups on GPUs, ASICs, and CPUs.
Baidu Tech Salon
Baidu Tech Salon, organized by Baidu's Technology Management Department, is a monthly offline event that shares cutting‑edge tech trends from Baidu and the industry, providing a free platform for mid‑to‑senior engineers to exchange ideas.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
