Artificial Intelligence 14 min read

How to Speed Up Deep Learning Models: Cutting-Edge Acceleration Techniques

Deep learning models often suffer from slow training and deployment due to their size, but a range of advanced acceleration methods—including model architecture optimization, pruning, quantization, knowledge distillation, and distributed training techniques—can dramatically improve speed and efficiency while maintaining performance.

Tencent Tech

Feb 27, 2020

How to Speed Up Deep Learning Models: Cutting-Edge Acceleration Techniques

Why Accelerate Deep Learning Models?

In recent years deep learning has achieved remarkable results in image, text, speech, and recommendation domains. Tencent teams apply these models to real‑world services, but large data volumes and limited compute resources lead to slow training, high latency, and difficult deployment. Therefore models must be affordable and fast to be widely adopted.

Model acceleration aims for "fast, good, cheap" by improving training and inference speed through computation, system, and hardware optimizations.

How to Reduce Computation?

Computation‑optimisation techniques seek a balance between model performance and efficiency, focusing on four main approaches: model structure optimisation, pruning, quantisation, and knowledge distillation.

1. Model Structure Optimisation

Design lightweight computational components to replace heavy ones, often based on human experience. In CNNs, replacing fully‑connected layers with convolutional filters (e.g., NIN, VGG, GoogLeNet, SqueezeNet, MobileNet, ShuffleNet) reduces parameters while preserving accuracy. Similar ideas apply to sequence models such as QRNN, SRNN, and Transformer‑based architectures like Reformer.

Comparison of CNN model optimisation methods

2. Model Pruning

Pruning removes redundant parameters to shrink over‑parameterised models. Two categories exist: structured pruning (e.g., channel‑level, filter‑level) that yields regular sparse matrices, and unstructured pruning that creates sparse matrices but often lacks hardware support. Methods like Metapruning learn weight‑generation models to evaluate candidate structures without full retraining.

3. Model Quantisation

Quantisation reduces the bit‑width of weights and activations (e.g., FP16, mixed precision, INT8) to accelerate computation. While INT8 offers speed gains, it may degrade accuracy. Techniques such as value‑range adjustment, calibration, and quantisation‑aware training mitigate accuracy loss. Specialized quantisation (binary, ternary, XNOR) further compress models.

Weight distribution of deep neural network

4. Knowledge Distillation

Distillation transfers knowledge from a large teacher model to a smaller student model, improving inference speed while retaining performance. Variants include representation distillation, multi‑step distillation with assistants, multi‑task distillation, and approaches like TinyBERT that apply two‑stage distillation to pre‑training and fine‑tuning phases.

Teacher‑Student model distillation framework

How to Compute Faster?

The most effective way to accelerate training is to scale from single‑machine to multi‑machine setups, using model parallelism or data parallelism. Parameter Server (PS) architectures synchronise parameters across workers, but network communication can become a bottleneck.

Research focuses on two directions: developing new communication mechanisms (e.g., RingAllReduce) and reducing data exchange volume (e.g., gradient compression). RingAllReduce arranges nodes in a ring, allowing each node to exchange partial data with neighbours, completing synchronisation in 2·(N‑1) steps.

Gradient compression techniques such as DGC select only important gradients for transmission, achieving up to 600× compression with negligible accuracy loss, though they may affect convergence if too aggressive. Gradient accumulation and compensation remain common practical solutions.

Summary & Outlook

This survey examined computation‑ and system‑level acceleration techniques for deep learning models, highlighting industry research and practical advances. Model acceleration is a systematic engineering challenge that requires co‑design of algorithms and infrastructure, tailored to specific models and application scenarios.

Tencent will continue to explore gradient and parameter compression, as well as pre‑training optimisation, and will share further insights in future publications.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Deep Learning Quantization pruning model acceleration Knowledge Distillation distributed training

Written by

Tencent Tech

Tencent's official tech account. Delivering quality technical content to serve developers.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.