Artificial Intelligence 23 min read

Survey of Model Pruning and Quantization Techniques for Deep Learning

This article provides a comprehensive overview of recent advances in deep learning model compression, focusing on pruning methods—including unstructured, structured, filter-wise, channel-wise, shape-wise, and stripe-wise approaches—and quantization techniques such as linear, non‑linear, clustering, power‑of‑two, binary, and 8‑bit quantization, while discussing evaluation criteria, sparsity ratios, fine‑tuning, and training‑aware quantization.

DataFunSummit

Jan 29, 2022

Survey of Model Pruning and Quantization Techniques for Deep Learning

Deep learning has achieved remarkable performance in recent years, but the increasing model size and computational complexity make deployment on resource‑limited hardware challenging. Model compression and acceleration aim to reduce network parameters and computational cost while preserving accuracy.

Compression techniques include network architecture optimization (e.g., using smaller convolution kernels, depth‑wise convolutions), pruning (removing redundant weights, neurons, or layers), quantization (reducing precision of weights and activations), and hardware‑level optimizations (leveraging GPUs, FPGAs, ASICs, TPUs).

Pruning can be categorized as unstructured (weight‑level sparsity) and structured (filter‑wise, channel‑wise, shape‑wise, stripe‑wise). Unstructured pruning creates sparse matrices, which require specialized hardware for speedup. Structured pruning removes entire filters or channels, preserving dense computation patterns. Various algorithms—greedy, L1 regularization, group lasso, and specialized modules like Filter Skeleton—determine importance and guide pruning steps.

The pruning workflow typically involves: (1) training a high‑performance model, (2) evaluating parameter importance, (3) removing low‑importance parameters, (4) fine‑tuning on the training set, and (5) iterating until desired size, speed, and accuracy are achieved.

Quantization maps high‑precision floating‑point values to lower‑bit fixed‑point representations. Linear quantization (symmetric or asymmetric) uses scale and zero‑point parameters, while non‑linear quantization employs clustering or custom mapping functions. Common schemes include power‑of‑two quantization, binary (1‑bit) quantization, and 8‑bit quantization, each balancing model size, inference speed, and accuracy loss.

Quantization can be performed post‑training (PTQ) or with quantization‑aware training (QAT). PTQ uses calibration data to estimate activation ranges, whereas QAT inserts fake‑quantization ops during training, allowing gradients to flow through a straight‑through estimator (STE) and preserving accuracy.

The article concludes with practical recommendations: start with symmetric per‑channel weight quantization, apply fine‑tuning if accuracy degrades, use QAT for higher precision retention, and consider hardware constraints when choosing pruning and quantization strategies.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Deep Learning model compression Quantization neural networks pruning

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.