Artificial Intelligence 26 min read

Survey of Model Compression and Quantization Techniques for Deep Neural Networks

This article provides a comprehensive overview of deep learning model compression and acceleration methods, detailing pruning strategies, various pruning types, evaluation criteria, sparsity ratios, fine‑tuning procedures, as well as linear and non‑linear quantization approaches, their implementations, and practical considerations.

Laiye Technology Team
Laiye Technology Team
Laiye Technology Team
Survey of Model Compression and Quantization Techniques for Deep Neural Networks

1. Overview

Recent advances in deep learning have led to models with ever‑increasing parameters and computational complexity, making deployment on resource‑constrained hardware challenging. Model compression and acceleration aim to reduce parameter count and computational cost while preserving performance.

Compression focuses on decreasing the number of parameters, whereas acceleration targets lowering computational complexity.

Techniques include architectural redesign (e.g., using smaller 3×3 kernels, replacing full‑connection layers with average‑pooling, employing depth‑wise convolutions as in MobileNets), as well as pruning, quantization, and knowledge distillation.

Hardware‑level optimizations involve inference frameworks such as TensorRT, Tf‑lite, NCNN, MNN, and specialized hardware like GPUs, FPGAs, ASICs, TPUs, and NPUs.

2. Pruning

2.1 Pruning Process

Deep neural networks contain many redundant parameters; pruning removes unimportant weights, neurons, or layers to reduce model size and inference cost, analogous to a gardener trimming a dense plant.

The typical workflow is:

Train a high‑performance original model.

Assess the importance of each parameter.

Remove parameters with low importance.

Fine‑tune on the training set to recover accuracy.

Check whether size, speed, and accuracy meet requirements; repeat if necessary.

2.2 Pruning Types

Pruning can be categorized by the basic operation unit:

Unstructured pruning : removes individual weight elements, resulting in sparse matrices.

Structured pruning : removes whole filters or channels, preserving dense matrix structures and enabling efficient execution on existing hardware.

2.2.1 Unstructured Pruning

Weights with the smallest absolute values are set to zero based on a global ranking.

2.2.2 Structured Pruning – Filter‑wise

Entire convolutional kernels (filters) are removed, which also reduces the corresponding feature‑map channels in the next layer.

2.2.3 Structured Pruning – Channel‑wise

Channels are pruned by leveraging batch‑norm scaling factors; channels with small scaling factors are considered less important.

2.2.4 Structured Pruning – Shape‑wise

Pruning granularity is finer, targeting specific positions within each kernel.

2.2.5 Structured Pruning – Stripe‑wise (SWP)

Stripes (1×1×C slices) are formed from 3×3×C kernels and pruned based on learned importance scores (FS module).

2.3 Pruning Evaluation Criteria

Commonly a greedy approach ranks importance scores (e.g., weight magnitude, sum of absolute values) and removes a proportion of parameters. Regularization techniques such as L1 or group‑lasso are often added to encourage sparsity.

3. Sparsity Ratio / Pruning Rate

Sparsity can be predefined globally or locally per layer, or adaptively determined during training.

4. Fine‑Tuning

Since pruning alters the network structure, fine‑tuning is required to recover lost accuracy, often alternating pruning and fine‑tuning steps.

3. Quantization

3.1 Basic Principles

Quantization maps high‑precision floating‑point values to lower‑bit fixed‑point representations. Linear quantization (most common in industry) uses a scale (S) and zero‑point (Z) to convert between float and integer domains.

3.1.1 Linear Quantization

Formulas: Q = round(R / S) + Z and R = S·(Q‑Z). Scale is derived from the min/max of the floating‑point tensor and the target integer range.

3.1.2 Non‑Linear Quantization

Non‑linear mappings allocate more quantization levels to important weight ranges, often using clustering (e.g., K‑means) or piecewise functions.

3.2 Quantization Methods

3.2.1 Clustering Quantization

Weights are clustered into k centroids (e.g., -1, 0, 1, 2) and each weight is replaced by its nearest centroid.

3.2.2 Power‑of‑Two Quantization

Weights are rounded to the nearest power‑of‑two, enabling shift‑based multiplication.

3.2.3 Binary Quantization (1‑bit)

Weights are binarized using a sign function or stochastic rounding; gradients are approximated with a straight‑through estimator.

3.2.4 8‑bit Quantization

Both symmetric (range [-128,127]) and asymmetric ([0,255]) schemes are widely supported (TensorFlow, TensorRT). Symmetric quantization may truncate outliers; asymmetric quantization uses a non‑zero zero‑point.

3.3 Post‑Training Quantization (PTQ) vs. Quantization‑Aware Training (QAT)

PTQ calibrates scale and zero‑point using a small calibration dataset, optionally applying KL‑divergence to select optimal ranges. QAT inserts fake‑quantization ops during training, using the straight‑through estimator to back‑propagate gradients through quantization.

3.4 Fine‑Tuning after Quantization

When quantization induces noticeable accuracy loss, fine‑tuning (or QAT) can restore performance, often achieving <5% degradation for 8‑bit models and even acceptable results for 4‑bit models.

4. Summary

Model compression and acceleration remain active research areas. Pruning and quantization provide complementary ways to obtain lightweight, high‑accuracy, fast‑inference models. Selecting appropriate techniques, sparsity levels, and fine‑tuning strategies is crucial for successful deployment.

5. References

https://jinzhuojun.blog.csdn.net/article/details/100621397

https://cs.nju.edu.cn/wujx/paper/Pruning_Survey_MLA21.pdf

https://blog.csdn.net/weixin_49457347/article/details/117110458

https://zhuanlan.zhihu.com/p/138059904

https://blog.csdn.net/wspba/article/details/75675554

http://fjdu.github.io/machine/learning/2016/07/07/quantize-neural-networks-with-tensorflow.html

https://zhuanlan.zhihu.com/p/45496826

https://zhuanlan.zhihu.com/p/361957385

https://zhuanlan.zhihu.com/p/374374300

https://blog.csdn.net/WZZ18191171661/article/details/103332338

https://zhuanlan.zhihu.com/p/58182172

https://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf

https://developer.download.nvidia.cn/video/gputechconf/gtc/2020/presentations/s21664-toward-int8-inference-deploying-quantization-aware-trained-networks-using-tensorrt.pdf

DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING

Learning Structured Sparsity in Deep Neural Networks

Learning Efficient Convolutional Networks through Network Slimming

Accelerating Convolutional Neural Networks by Group‑wise 2D‑filter Pruning

PRUNING FILTERS FOR EFFICIENT CONVNETS

Quantizing deep convolutional networks for efficient inference: A whitepaper

Data‑Free Quantization Through Weight Equalization and Bias Correction

PRUNING FILTER IN FILTER

Author: Li Xinke

efficiencyDeep Learningmodel compressionquantizationneural networkspruning
Laiye Technology Team
Written by

Laiye Technology Team

Official account of Laiye Technology, featuring its best tech innovations, practical implementations, and cutting‑edge industry insights.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.