vLLM Quantized Inference: Loading AWQ/GPTQ Models and Optimizing GPU Memory
This article provides a step‑by‑step guide on using vLLM to load AWQ and GPTQ quantized large language models, covering environment setup, calibration data preparation, model quantization, deployment scripts, performance benchmarking, accuracy checks, best‑practice recommendations, and troubleshooting tips for GPU memory optimization.
The article explains how to reduce the GPU memory footprint of large language models (LLMs) by applying AWQ and GPTQ quantization techniques using the vLLM inference engine. It starts with an overview of the memory challenges for 70B‑parameter models and presents empirical results showing a 50‑75% reduction in VRAM usage with only 3‑5% accuracy loss.
Key steps include:
Preparing the system (Ubuntu 20.04+, CUDA 11.8+, Python 3.9‑3.11) and installing required packages such as torch, vllm, awq, and auto-gptq.
Downloading original model checkpoints (e.g., LLaMA2‑7B‑Chat, Mistral‑7B‑Instruct) via huggingface-cli.
Generating a small calibration dataset (128 samples from WikiText‑2) and saving it as /tmp/awq_calibration.json.
Running AWQ 4‑bit quantization with a configuration {"zero_point": true, "q_group_size": 128, "w_bit": 4} and saving the quantized model.
Running GPTQ quantization using BaseQuantizeConfig (bits=4, group_size=128, damp_percent=0.01) and saving the result.
Launching the quantized model as an API service with python -m vllm.entrypoints.api_server, specifying --quantization awq or --quantization gptq, and tuning --gpu-memory-utilization and --swap-space for CPU offload.
Benchmarking inference speed, memory usage, and token throughput with a custom benchmark_quantized.py script, and comparing results against the FP16 baseline.
Performing a qualitative accuracy test on the TruthfulQA benchmark and reporting mean quantization error.
The performance table shows that AWQ 4‑bit reduces memory from 13.45 GB to 4.12 GB on an RTX 4090, cuts latency by ~25 % and maintains 95 % of the original MMLU score, while GPTQ 4‑bit offers similar memory savings with a 30 % speed boost. Recommendations advise using 8‑bit quantization when accuracy is critical, 4‑bit for edge devices, and GPTQ when fast quantization or EXL2 format is needed.
Best‑practice sections cover selecting bit‑width based on GPU capacity, using domain‑specific calibration data, enabling prefix caching, employing EXL2 for further speed gains, and deploying multi‑model services with fallback mechanisms. The article also lists common pitfalls, detailed troubleshooting commands, monitoring metrics, backup and restore procedures, and provides configuration templates for various hardware scenarios.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Raymond Ops
Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
