vLLM Quantized Inference: Loading AWQ/GPTQ Models and Optimizing GPU Memory

This article provides a step‑by‑step guide on using vLLM to load AWQ and GPTQ quantized large language models, covering environment setup, calibration data preparation, model quantization, deployment scripts, performance benchmarking, accuracy checks, best‑practice recommendations, and troubleshooting tips for GPU memory optimization.

Raymond Ops
Raymond Ops
Raymond Ops
vLLM Quantized Inference: Loading AWQ/GPTQ Models and Optimizing GPU Memory

The article explains how to reduce the GPU memory footprint of large language models (LLMs) by applying AWQ and GPTQ quantization techniques using the vLLM inference engine. It starts with an overview of the memory challenges for 70B‑parameter models and presents empirical results showing a 50‑75% reduction in VRAM usage with only 3‑5% accuracy loss.

Key steps include:

Preparing the system (Ubuntu 20.04+, CUDA 11.8+, Python 3.9‑3.11) and installing required packages such as torch, vllm, awq, and auto-gptq.

Downloading original model checkpoints (e.g., LLaMA2‑7B‑Chat, Mistral‑7B‑Instruct) via huggingface-cli.

Generating a small calibration dataset (128 samples from WikiText‑2) and saving it as /tmp/awq_calibration.json.

Running AWQ 4‑bit quantization with a configuration {"zero_point": true, "q_group_size": 128, "w_bit": 4} and saving the quantized model.

Running GPTQ quantization using BaseQuantizeConfig (bits=4, group_size=128, damp_percent=0.01) and saving the result.

Launching the quantized model as an API service with python -m vllm.entrypoints.api_server, specifying --quantization awq or --quantization gptq, and tuning --gpu-memory-utilization and --swap-space for CPU offload.

Benchmarking inference speed, memory usage, and token throughput with a custom benchmark_quantized.py script, and comparing results against the FP16 baseline.

Performing a qualitative accuracy test on the TruthfulQA benchmark and reporting mean quantization error.

The performance table shows that AWQ 4‑bit reduces memory from 13.45 GB to 4.12 GB on an RTX 4090, cuts latency by ~25 % and maintains 95 % of the original MMLU score, while GPTQ 4‑bit offers similar memory savings with a 30 % speed boost. Recommendations advise using 8‑bit quantization when accuracy is critical, 4‑bit for edge devices, and GPTQ when fast quantization or EXL2 format is needed.

Best‑practice sections cover selecting bit‑width based on GPU capacity, using domain‑specific calibration data, enabling prefix caching, employing EXL2 for further speed gains, and deploying multi‑model services with fallback mechanisms. The article also lists common pitfalls, detailed troubleshooting commands, monitoring metrics, backup and restore procedures, and provides configuration templates for various hardware scenarios.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonQuantizationvLLMGPTQLLM InferenceGPU memory optimizationAWQ
Raymond Ops
Written by

Raymond Ops

Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.