Artificial Intelligence 11 min read

How Can Deep Neural Networks Be Accelerated and Compressed? Key Techniques Explained

This article reviews why deep neural networks are over‑parameterized, outlines the challenges of deploying them on mobile and embedded devices, and presents six major strategies—pruning, low‑rank approximation, filter selection, quantization, knowledge distillation, and novel architecture design—to accelerate and compress models while preserving performance.

Hulu Beijing

Apr 30, 2019

How Can Deep Neural Networks Be Accelerated and Compressed? Key Techniques Explained

In recent years deep neural networks have become essential in vision (image classification, video analysis) and language (translation, speech recognition) tasks, but popular architectures such as VGG‑16 (≈60 M parameters) and ResNet‑50 (≈25 M parameters) face storage and computation bottlenecks that limit their use on mobile and embedded devices.

These networks contain a large amount of redundant parameters, a phenomenon known as over‑parameterization. Compressing networks to reduce model size, inference time, and memory consumption while maintaining task performance has therefore become a hot research topic.

Question

From which aspects can neural networks be accelerated and compressed?

Analysis and Answer

Network compression techniques can be grouped into six main categories:

1) Network Parameter Pruning

Pruning removes neurons or connections with low importance, reducing weight count without altering the original architecture. Typical pipelines involve training a large model, applying a pruning strategy, and fine‑tuning the pruned network. Early work by Han et al. set a magnitude threshold to zero out small weights, while later dynamic pruning methods adapt importance scores during training.

2) Low‑Rank Matrix Approximation

Approximating weight matrices with low‑rank factors reduces the number of parameters and computational cost.

3) Convolutional Kernel/Filter Selection

Instead of pruning individual weights, entire kernels or filters are removed, decreasing both the number of filters and the size of feature maps, which speeds up computation. Methods based on kernel weight statistics or feature‑map information (e.g., ThiNet) select filters to discard while minimizing impact on downstream layers.

4) Quantization and Encoding

Quantization reduces the precision of weights (e.g., scalar, vector, or product quantization). Incremental Network Quantization (INQ) groups parameters, quantizes a subset, and retrains the rest iteratively, achieving lossless low‑bit representations. Binary networks such as BinaryNet and XNOR‑Net further compress models for hardware deployment.

5) Knowledge Distillation

A large “teacher” network transfers its knowledge to a smaller “student” network, typically by adding a distillation loss that combines the standard cross‑entropy with a term encouraging the student’s outputs to match the teacher’s.

6) Designing New Network Architectures

Modern architectures embed compression ideas directly, e.g., SqueezeNet and MobileNet use 1×1 convolutions to reduce redundancy, while Xception introduces depthwise separable convolutions to lower parameter count and computation.

These methods are increasingly hardware‑aware, aiming to produce compact, fast models rather than merely maximizing accuracy.

References: [1] Denil et al., 2013; [2] Han et al., 2015; [3] Guo et al., 2016; [4] Wen et al., 2017; [5] Denton et al., 2014; [6] Kim‑D et al., 2015; [7] Jaderberg et al., 2014; [8] Szegedy et al., 2016; [9] Zhang et al., 2016; [10] Szegedy et al., 2015; [11] Li et al., 2016; [12] Hu et al., 2016; [13] Luo et al., 2017; [14] Zhou et al., 2017; [15] Courbariaux et al., 2016; [16] Rastegari et al., 2016; [17] Ba & Caruana, 2014; [18] Chollet, 2017; [19] Liu et al., 2018.

deep learning Quantization pruning model acceleration Knowledge Distillation network compression

Written by

Hulu Beijing

Follow Hulu's official WeChat account for the latest company updates and recruitment information.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.