Understanding LLM Quantization: GPTQ, QAT, AWQ, GGUF, and GGML Explained
The article walks through the fundamentals of large‑language‑model quantization, presenting a concrete int8 example, detailed explanations of GPTQ, GGUF/GGML, QAT, and AWQ methods, and provides step‑by‑step code snippets, formulas, calibration procedures, and performance observations for each technique.
Basic Int8 Quantization Example
Old range = max(FP16 weight) – min(FP16 weight) = 0.932 – 0.0609 = 0.871
New range for int8 = 127 – (‑128) = 255
Scale = 127 / 0.932 = 136.2472498690413
Quantized value formula:
Quantized Value = Round(Scale × Original Value)
De‑quantization uses
New Value = Quantized Value / Scale
. Converting back to FP16 introduces small differences (e.g., 0.5415 becomes 0.543), illustrating quantization‑dequantization error.
LLM Quantization Types
GPTQ (Gradient‑based Post‑Training Quantization)
GPTQ is a post‑training quantization method that works best on GPUs. Variants include static‑range GPTQ (weights + activations), dynamic‑range GPTQ (weights + runtime activation quantization), and weight‑only quantization.
Static‑Range GPTQ Process
Calibration dataset: sample (e.g., 1 000 examples) from the original pre‑training data.
Run inference on the calibration set to collect activation and weight ranges (e.g., activation range 0.2‑0.9, weight range 0.1‑0.3).
Apply the same mathematical conversion described in the int8 example to each layer.
Algorithm steps:
Weight matrix grouping – split each layer’s weight matrix into column groups (e.g., group_size = 128).
Iterative quantization – quantize one column in a group, then adjust remaining columns to compensate for error.
Global error compensation – after processing a group, update other groups (lazy batch update).
Code example (quantizing a 7B‑parameter Mistral shard with auto_gptq on a Google Colab T4 GPU):
!pip install auto_gptq
import torch
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import TextGenerationPipeline, AutoTokenizer
pretrained_model_name = "bigscience/bloom-3b"
quantize_config = BaseQuantizeConfig(bits=4, group_size=128)
model = AutoGPTQForCausalLM.from_pretrained(
pretrained_model_name,
quantize_config,
trust_remote_code=False,
device_map="auto",
torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name)
examples = [tokenizer("Automated machine learning is the process of automating the tasks of applying machine learning to real-world problems. AutoML potentially includes every stage from beginning with a raw dataset to building a machine learning model ready for deployment.")]
model.quantize(examples)
quantized_model_dir = "bloom3b_q4b_gs128"
model.save_quantized(quantized_model_dir)
# Inference with the quantized model
device = "cuda:0"
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device=device, torch_dtype=torch.float16)
pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer, max_new_tokens=50)
print(pipeline("Automated machine learning is")[0]["generated_text"])Running the inference code increases GPU memory usage, confirming that the quantized model runs on the GPU.
GGUF / GGML
GGUF is the next‑generation format of GGML, a C++ library for LLM inference that supports models such as LLaMA and Falcon. GGUF works on both Windows and Linux CPUs and offers 2‑ to 8‑bit quantization levels.
Conversion steps:
# Install llama.cpp
!git clone https://github.com/ggerganov/llama.cpp
!cd llama.cpp && git pull && make clean && LLAMA_CUBLAS=1 make
!pip install -r llama.cpp/requirements.txt
# Download model
!git lfs install
!git clone https://huggingface.co/Siddharthvij10/MistralSharded2
# Convert weights to fp16
!python llama.cpp/convert.py MistralSharded2 --outtype f16 --outfile "MistralSharded2/mistralsharded2.fp16.bin"
# Quantize (requires GPU RAM but uses little of it)
!./llama.cpp/quantize "MistralSharded2/mistralsharded2.fp16.bin" "MistralSharded2/mistralsharded2.Q4_K_M.gguf" q4_k_mInference with the GGML model:
./llama.cpp/main -m "MistralSharded2/mistralsharded2.Q4_K_M.gguf" -n 35 --color -ngl 32 -p "Automated machine learning"Quantization‑Aware Training (QAT)
QAT starts from a pre‑trained or PTQ model and fine‑tunes it to recover accuracy lost during quantization. Only layers that tolerate low‑precision are quantized; others remain in full precision. The core idea is pseudo‑quantization node insertion: inputs are quantized, the multiplication is performed, then the output is de‑quantized back to high precision.
During forward propagation QAT introduces quantization error, which accumulates and is corrected by the optimizer in back‑propagation.
!pip uninstall -y tensorflow
!pip install -q tf-nightly # daily fixes
!pip install -q tensorflow-model-optimization
import numpy as np, pandas as pd
import tensorflow as tf
import tensorflow_model_optimization as tfmot
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Sample data
np.random.seed(0)
data = pd.DataFrame(np.random.rand(1000,5), columns=['Feature1','Feature2','Feature3','Feature4','Feature5'])
target = pd.Series(np.random.randint(0,2,size=1000), name='Target')
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Simple Keras model
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, activation='relu', input_shape=(5,)),
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X_train_scaled, y_train, epochs=10, batch_size=32, validation_split=0.2)
# Quantization‑aware model
quant_aware_model = tfmot.quantization.keras.quantize_model(model)
quant_aware_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
quant_aware_model.fit(X_train_scaled, y_train, epochs=10, batch_size=32, validation_split=0.2)AWQ (Activation‑aware Weight Quantization)
AWQ selectively quantizes weights that have little impact on model performance, keeping critical weights at higher precision.
Calibration – run a few samples through the LLM to collect weight and activation distributions.
Scaling – amplify important weights while applying low‑precision quantization to non‑critical weights.
SafeTensor conversion – weights must be stored in .safetensors format. Conversion guide: https://huggingface.co/spaces/safetensors/convert
References
GitHub: https://github.com/siddharthvij10/BITS_Mtech/tree/main/LLM/ModelCompression
AI Algorithm Path
A public account focused on deep learning, computer vision, and autonomous driving perception algorithms, covering visual CV, neural networks, pattern recognition, related hardware and software configurations, and open-source projects.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
