Artificial Intelligence 13 min read

How PaddlePaddle 3.0 Boosts Large‑Model Inference with 4‑Bit Quantization and MLA Optimizations

PaddlePaddle 3.0 introduces a full‑stack inference engine that supports FP8, INT8, and 4‑bit quantization for popular LLMs such as DeepSeek V3/R1, delivers up to 2× token throughput on a single H800 GPU, and provides detailed deployment scripts for single‑node and multi‑node setups, including MTP speculative decoding and SageAttention for long‑sequence acceleration.

Baidu Tech Salon

Mar 13, 2025

How PaddlePaddle 3.0 Boosts Large‑Model Inference with 4‑Bit Quantization and MLA Optimizations

PaddlePaddle 3.0 dramatically upgrades large‑model inference by adding high‑performance support for multiple quantization precisions (FP8, INT8, 4‑bit) and a flexible intermediate representation (PIR) that optimizes model compression, compute, and deployment across various hardware.

Key Model Support and Performance

The framework now fully supports DeepSeek V3/R1 (full‑precision and distilled versions) with FP8 inference, achieving over 1,000 tokens / s on a single H800 GPU and up to 2,000 tokens / s when using the 4‑bit quantized deployment, effectively doubling throughput compared to previous releases.

MLA Operator Optimizations

MLA kernels have been re‑engineered with multi‑stage pipelines, fine‑tuned register and shared‑memory allocation, and warp‑group scheduling. Compared with FlashMLA, performance gains range from 4 % to 23 %.

Three‑warp‑group scheduling processes 64‑length KV sequences using 225 KB of shared memory. A four‑warp‑group variant overlaps PV GEMM with Softmax, handling 32‑length KV sequences and reducing register pressure (max 128 registers per thread). These optimizations yield a 4 %–23 % speedup on Hopper GPUs.

MTP Speculative Decoding and Throughput

The new speculative decoding mechanism decouples the base model from the decoding method, supporting draft models, MTP/Eagle, and reference‑matching paradigms with minimal code changes. Optimizations keep batch size constant while validating all draft tokens in a single pass, improving QPS by 144 % and decoding speed by 42 % without degrading latency.

Long‑Sequence Attention Quantization (SageAttention)

SageAttention dynamically quantizes Q/K to INT8 and V to FP8, reorganizing the attention stages. INT32 Q·K results are converted to FP32 for softmax, then re‑quantized, while the final output is FP8. This approach preserves near‑lossless accuracy and accelerates the Prefill stage for 64K‑token inputs by 37.4 %.

Deployment Scripts

One‑click scripts are provided for rapid service launch. Example single‑node Docker command (H800, 4‑bit quantization):

export MODEL_PATH=${MODEL_PATH:-$PWD}
export model_name="deepseek-ai/DeepSeek-R1/weight_only_int4"

docker run --gpus all --shm-size 32G --network=host --privileged --cap-add=SYS_PTRACE \
  -v $MODEL_PATH:/models \
  -e "model_name=${model_name}" \
  -e "MP_NUM=8" \
  -e "CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7" \
  -dit ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddlenlp:llm-serving-cuda124-cudnn9-v2.1 /bin/bash \
  -c -ex 'start_server $model_name && tail -f /dev/null' && docker logs -f $(docker ps -lq)

For two‑node deployment, set POD_0_IP and POD_IPS, adjust MP_NUM and MP_NNODE, and run a similar Docker command with the appropriate environment variables.

Inference Requests

Example curl request:

curl 127.0.0.1:9965/v1/chat/completions -H 'Content-Type: application/json' \
  -d '{"model":"default","text":"Hello, how are you?"}'

Python OpenAI‑compatible client usage is also demonstrated.

Summary

PaddlePaddle 3.0 offers a complete stack—model compression tools, a high‑performance inference engine, and service‑oriented deployment—supporting a wide range of LLMs (DeepSeek, Qwen, Llama, Mixtral) and delivering precision‑preserving quantization (INT8, FP8, INT4) with substantial speedups on GPUs, ASICs, and CPUs.

Docker Quantization inference optimization large language models MTP PaddlePaddle MLA

Written by

Baidu Tech Salon

Baidu Tech Salon, organized by Baidu's Technology Management Department, is a monthly offline event that shares cutting‑edge tech trends from Baidu and the industry, providing a free platform for mid‑to‑senior engineers to exchange ideas.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.