Model Quantization: INT8, INT4, and AWQ/GPTQ – Choosing the Right Compression for Production

This article explains how INT8, INT4, bitsandbytes, GPTQ, and AWQ quantization methods can dramatically cut memory usage, boost inference speed, and lower costs for large language models, while detailing their trade‑offs, practical workflows, benchmark results, and common pitfalls to help engineers decide which technique best fits their production scenario.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Model Quantization: INT8, INT4, and AWQ/GPTQ – Choosing the Right Compression for Production

Why Quantization Matters

Deploying a 70B model on four 40 GB A100 GPUs quickly runs into a memory wall (180 GB+ peak), a bandwidth wall (70 ms start‑up for moving 140 GB of weights), a cost wall (≈¥100 k per month for 8 GPUs) and a power wall (3.2 kW for a full rack). Quantization trades a small loss in precision for large gains in memory, speed, and cost, turning “cannot fit / cannot run” models into deployable services.

Core Quantization Concepts

Quantization reduces the bit‑width of weights and activations from FP16 to INT8 or INT4. The basic formula is q = round((w - zero_point) / scale), where scale and zero_point are determined per granularity (per‑tensor, per‑channel, or per‑group). Per‑group (e.g., 32/64/128 weights per scale) offers the best accuracy‑speed balance.

Two main quantization timings exist:

Post‑Training Quantization (PTQ) – quantize a frozen FP16 model using a small calibration set (100‑500 samples).

Quantization‑Aware Training (QAT) – not covered here; used mainly in research.

Quantization Schemes Compared

INT8 (bitsandbytes) : almost no accuracy loss, 2× speed, memory halved. Recommended when you have enough memory to keep the model in FP16 but want a cheap speed boost.

INT4 (bitsandbytes NF4) : 4× memory reduction, 3‑4× speed, modest PPL increase (≈0.1‑0.2). Works well for 7‑13 B models on a single GPU.

GPTQ 4‑bit : uses second‑order (Hessian) information to minimise per‑layer error. Slightly better accuracy than plain INT4, but slower to quantize and requires careful group‑size tuning.

AWQ 4‑bit : protects the most important weight channels, giving the best overall trade‑off – higher accuracy than GPTQ, similar speed, and the smallest memory footprint among 4‑bit methods.

GGUF (llama.cpp) : CPU‑only format with several quantization levels (Q4_K_M, Q5_K_M, Q6_K, etc.). Ideal for edge devices or when GPU memory is unavailable.

Practical PTQ Workflow

Gather a calibration dataset that matches the target domain (e.g., Chinese Wikipedia for Chinese chat models).

Run a forward pass on 100‑500 samples to collect activation statistics.

Compute scale and zero_point per the chosen granularity.

Quantize the weights (and optionally activations) to INT8/INT4.

Save the quantized checkpoint.

Typical pitfalls include using the wrong calibration data (different language or domain) and selecting an inappropriate group_size (128 is a good default; 32 is fast but may hurt accuracy; 256 saves memory).

Code Snippets

# inference_8bit.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

bnb_8bit = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2-7B-Instruct",
    quantization_config=bnb_8bit,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-7B-Instruct")

inputs = tokenizer("北京是中国的", return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=50, do_sample=False)
print(tokenizer.decode(out[0], skip_special_tokens=True))
# inference_4bit.py
bnb_4bit = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",  # NormalFloat 4‑bit, 11 % more accurate than FP4
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_storage=torch.uint8,
)
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2-7B-Instruct",
    quantization_config=bnb_4bit,
    device_map="auto",
)
# quantize_gptq.py (simplified)
from datasets import load_dataset
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

calib_data = [d["text"] for d in load_dataset("wikitext", "wikitext-2-raw-v1", split="train[:512]") if len(d["text"]) > 200]

quantize_config = BaseQuantizeConfig(bits=4, group_size=128, desc_act=False, damp_percent=0.01)
model = AutoGPTQForCausalLM.from_pretrained("Qwen/Qwen2-7B-Instruct", quantize_config, trust_remote_code=True)
model.quantize(calib_data, use_triton=True, batch_size=4)
model.save_quantized("./qwen2-7b-gptq-4bit")
# quantize_awq.py (simplified)
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model = AutoAWQForCausalLM.from_pretrained("Qwen/Qwen2-7B-Instruct", safetensors=True)
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}
calib_data = [d["text"].strip() for d in load_dataset("wikitext", "wikitext-2-raw-v1", split="train[:128]") if len(d["text"]) > 200]
model.quantize(AutoTokenizer.from_pretrained("Qwen/Qwen2-7B-Instruct"), quant_config=quant_config, calib_data=calib_data)
model.save_quantized("./qwen2-7b-awq-4bit")

Performance Highlights (Qwen2‑7B on a single A100 80 GB)

FP16 baseline: 14 GB memory, 38 tokens/s, PPL 4.2.

INT8 (bitsandbytes): 7.8 GB, 28 tokens/s, PPL 4.21.

INT4 NF4 (bitsandbytes): 4.5 GB, 22 tokens/s, PPL 4.35.

GPTQ‑4bit (group 128): 4.8 GB, 56 tokens/s, PPL 4.42.

AWQ‑4bit (group 128): 4.7 GB, 62 tokens/s, PPL 4.32 – the best overall trade‑off.

Evaluation Checklist

After quantization, always verify both accuracy and business metrics**:

PPL – use transformers to compute; acceptable increase < 0.3 (≈5 %).

MMLU / C‑Eval / HumanEval – run with lm‑eval‑harness; tolerance < 2 %.

Business KPIs – A/B test with real queries; adoption rate must not drop > 1 %.

Typical failure modes include calibration data mismatch, outlier‑heavy weight distributions, and sampling‑parameter sensitivity.

Common Pitfalls and Mitigations (12 Tips)

Chat models need chat‑formatted calibration data; otherwise AWQ PPL can jump > 1.

Group size 128 balances speed and accuracy; use 64 for higher precision, 256 when memory is tight.

Match language of calibration data to model language; never use test set for calibration.

New architectures (e.g., MoE) may not be supported by AWQ – fall back to GPTQ.

bitsandbytes 4‑bit + multi‑GPU training may require specific bitsandbytes, accelerate, and transformers versions.

vLLM expects the Marlin kernel for AWQ; set "version": "GEMM" and upgrade vLLM to 0.4+.

Global PPL can hide long‑tail degradation; evaluate a curated set of high‑frequency business queries.

Sampling temperature becomes more sensitive after quantization – re‑tune or lower it.

NaN outputs in vLLM often stem from kernel mismatches – upgrade vLLM or disable CUDA graphs.

Quantization formats are not interchangeable; use the matching inference engine (vLLM for GPTQ/AWQ, llama.cpp for GGUF).

Multi‑turn dialogue suffers from KV‑cache precision loss – consider 8‑bit or enable prefix caching.

INT4 hurts very small models (< 3 B); prefer INT8 or keep FP16 for them.

Optimization Roadmap

Start with AWQ 4‑bit + vLLM – best for 90 % of production cases.

If precision is critical, use INT8 + continuous batching before dropping to INT4.

On H100 GPUs, try FP8 (E4M3) for near‑FP16 quality.

Enable KV‑cache quantization (vLLM 0.4+) for long‑context workloads.

Consider speculative decoding (draft + verify) to double throughput.

Leverage tensor parallelism when a single GPU cannot hold the model.

Use flash/paged attention and operator fusion – they are essential for speed.

For extreme latency needs, explore ahead‑of‑time compilation (TensorRT‑LLM, MLC‑LLM).

Decision Tree for Production

If GPU memory ≥ 2× model size → run FP16 baseline first.

If memory ≈ model size → apply INT8 (bitsandbytes) – negligible loss.

If memory < model size → use AWQ 4‑bit (default) or GPTQ 4‑bit.

If memory ≈ model size / 4 → combine AWQ 4‑bit with tensor parallelism.

For CPU‑only inference → convert to GGUF and pick Q4_K_M (best quality‑size trade‑off).

On H100 → evaluate FP8 (E4M3) for the highest accuracy‑speed ratio.

When to Switch to Other Techniques

Quantization loss exceeds business tolerance → move to INT8 or keep FP16.

Model still does not fit a single GPU after quantization → adopt tensor parallelism.

Training cost is prohibitive → use spot instances, DeepSpeed, and QLoRA.

Final Takeaway

Quantization is the most practical way to make large language models deployable: INT8 for low‑risk speed‑up, AWQ 4‑bit for the best memory‑speed‑accuracy balance, and FP8 on H100 for cutting‑edge hardware. Always keep a full‑precision backup and validate with real‑world queries before rolling out.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

GPTQINT8Model Quantizationperformance benchmarkingLLM deploymentINT4AWQ
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.