How Can Deep Neural Networks Be Accelerated and Compressed? Key Techniques Explained
This article reviews why deep neural networks are over‑parameterized, outlines the challenges of deploying them on mobile and embedded devices, and presents six major strategies—pruning, low‑rank approximation, filter selection, quantization, knowledge distillation, and novel architecture design—to accelerate and compress models while preserving performance.
In recent years deep neural networks have become essential in vision (image classification, video analysis) and language (translation, speech recognition) tasks, but popular architectures such as VGG‑16 (≈60 M parameters) and ResNet‑50 (≈25 M parameters) face storage and computation bottlenecks that limit their use on mobile and embedded devices.
These networks contain a large amount of redundant parameters, a phenomenon known as over‑parameterization. Compressing networks to reduce model size, inference time, and memory consumption while maintaining task performance has therefore become a hot research topic.
Question
From which aspects can neural networks be accelerated and compressed?
Analysis and Answer
Network compression techniques can be grouped into six main categories:
1) Network Parameter Pruning
Pruning removes neurons or connections with low importance, reducing weight count without altering the original architecture. Typical pipelines involve training a large model, applying a pruning strategy, and fine‑tuning the pruned network. Early work by Han et al. set a magnitude threshold to zero out small weights, while later dynamic pruning methods adapt importance scores during training.
2) Low‑Rank Matrix Approximation
Approximating weight matrices with low‑rank factors reduces the number of parameters and computational cost.
3) Convolutional Kernel/Filter Selection
Instead of pruning individual weights, entire kernels or filters are removed, decreasing both the number of filters and the size of feature maps, which speeds up computation. Methods based on kernel weight statistics or feature‑map information (e.g., ThiNet) select filters to discard while minimizing impact on downstream layers.
4) Quantization and Encoding
Quantization reduces the precision of weights (e.g., scalar, vector, or product quantization). Incremental Network Quantization (INQ) groups parameters, quantizes a subset, and retrains the rest iteratively, achieving lossless low‑bit representations. Binary networks such as BinaryNet and XNOR‑Net further compress models for hardware deployment.
5) Knowledge Distillation
A large “teacher” network transfers its knowledge to a smaller “student” network, typically by adding a distillation loss that combines the standard cross‑entropy with a term encouraging the student’s outputs to match the teacher’s.
6) Designing New Network Architectures
Modern architectures embed compression ideas directly, e.g., SqueezeNet and MobileNet use 1×1 convolutions to reduce redundancy, while Xception introduces depthwise separable convolutions to lower parameter count and computation.
These methods are increasingly hardware‑aware, aiming to produce compact, fast models rather than merely maximizing accuracy.
References: [1] Denil et al., 2013; [2] Han et al., 2015; [3] Guo et al., 2016; [4] Wen et al., 2017; [5] Denton et al., 2014; [6] Kim‑D et al., 2015; [7] Jaderberg et al., 2014; [8] Szegedy et al., 2016; [9] Zhang et al., 2016; [10] Szegedy et al., 2015; [11] Li et al., 2016; [12] Hu et al., 2016; [13] Luo et al., 2017; [14] Zhou et al., 2017; [15] Courbariaux et al., 2016; [16] Rastegari et al., 2016; [17] Ba & Caruana, 2014; [18] Chollet, 2017; [19] Liu et al., 2018.
Hulu Beijing
Follow Hulu's official WeChat account for the latest company updates and recruitment information.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
