Artificial Intelligence 23 min read

Survey of Model Pruning and Quantization Techniques for Deep Learning

This article provides a comprehensive overview of recent advances in deep learning model compression, focusing on pruning methods—including unstructured, structured, filter-wise, channel-wise, shape-wise, and stripe-wise approaches—and quantization techniques such as linear, non‑linear, clustering, power‑of‑two, binary, and 8‑bit quantization, while discussing evaluation criteria, sparsity ratios, fine‑tuning, and training‑aware quantization.

DataFunSummit
DataFunSummit
DataFunSummit
Survey of Model Pruning and Quantization Techniques for Deep Learning

Deep learning has achieved remarkable performance in recent years, but the increasing model size and computational complexity make deployment on resource‑limited hardware challenging. Model compression and acceleration aim to reduce network parameters and computational cost while preserving accuracy.

Compression techniques include network architecture optimization (e.g., using smaller convolution kernels, depth‑wise convolutions), pruning (removing redundant weights, neurons, or layers), quantization (reducing precision of weights and activations), and hardware‑level optimizations (leveraging GPUs, FPGAs, ASICs, TPUs).

Pruning can be categorized as unstructured (weight‑level sparsity) and structured (filter‑wise, channel‑wise, shape‑wise, stripe‑wise). Unstructured pruning creates sparse matrices, which require specialized hardware for speedup. Structured pruning removes entire filters or channels, preserving dense computation patterns. Various algorithms—greedy, L1 regularization, group lasso, and specialized modules like Filter Skeleton—determine importance and guide pruning steps.

The pruning workflow typically involves: (1) training a high‑performance model, (2) evaluating parameter importance, (3) removing low‑importance parameters, (4) fine‑tuning on the training set, and (5) iterating until desired size, speed, and accuracy are achieved.

Quantization maps high‑precision floating‑point values to lower‑bit fixed‑point representations. Linear quantization (symmetric or asymmetric) uses scale and zero‑point parameters, while non‑linear quantization employs clustering or custom mapping functions. Common schemes include power‑of‑two quantization, binary (1‑bit) quantization, and 8‑bit quantization, each balancing model size, inference speed, and accuracy loss.

Quantization can be performed post‑training (PTQ) or with quantization‑aware training (QAT). PTQ uses calibration data to estimate activation ranges, whereas QAT inserts fake‑quantization ops during training, allowing gradients to flow through a straight‑through estimator (STE) and preserving accuracy.

The article concludes with practical recommendations: start with symmetric per‑channel weight quantization, apply fine‑tuning if accuracy degrades, use QAT for higher precision retention, and consider hardware constraints when choosing pruning and quantization strategies.

Deep Learningmodel compressionquantizationneural networkspruning
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.