Artificial Intelligence 14 min read

How to Speed Up Deep Learning Models: Cutting-Edge Acceleration Techniques

Deep learning models often suffer from slow training and deployment due to their size, but a range of advanced acceleration methods—including model architecture optimization, pruning, quantization, knowledge distillation, and distributed training techniques—can dramatically improve speed and efficiency while maintaining performance.

Tencent Tech
Tencent Tech
Tencent Tech
How to Speed Up Deep Learning Models: Cutting-Edge Acceleration Techniques

Why Accelerate Deep Learning Models?

In recent years deep learning has achieved remarkable results in image, text, speech, and recommendation domains. Tencent teams apply these models to real‑world services, but large data volumes and limited compute resources lead to slow training, high latency, and difficult deployment. Therefore models must be affordable and fast to be widely adopted.

Model acceleration aims for "fast, good, cheap" by improving training and inference speed through computation, system, and hardware optimizations.

How to Reduce Computation?

Computation‑optimisation techniques seek a balance between model performance and efficiency, focusing on four main approaches: model structure optimisation, pruning, quantisation, and knowledge distillation.

1. Model Structure Optimisation

Design lightweight computational components to replace heavy ones, often based on human experience. In CNNs, replacing fully‑connected layers with convolutional filters (e.g., NIN, VGG, GoogLeNet, SqueezeNet, MobileNet, ShuffleNet) reduces parameters while preserving accuracy. Similar ideas apply to sequence models such as QRNN, SRNN, and Transformer‑based architectures like Reformer.

Comparison of CNN model optimisation methods
Comparison of CNN model optimisation methods

2. Model Pruning

Pruning removes redundant parameters to shrink over‑parameterised models. Two categories exist: structured pruning (e.g., channel‑level, filter‑level) that yields regular sparse matrices, and unstructured pruning that creates sparse matrices but often lacks hardware support. Methods like Metapruning learn weight‑generation models to evaluate candidate structures without full retraining.

Illustration of model pruning principle
Illustration of model pruning principle

3. Model Quantisation

Quantisation reduces the bit‑width of weights and activations (e.g., FP16, mixed precision, INT8) to accelerate computation. While INT8 offers speed gains, it may degrade accuracy. Techniques such as value‑range adjustment, calibration, and quantisation‑aware training mitigate accuracy loss. Specialized quantisation (binary, ternary, XNOR) further compress models.

Weight distribution of deep neural network
Weight distribution of deep neural network

4. Knowledge Distillation

Distillation transfers knowledge from a large teacher model to a smaller student model, improving inference speed while retaining performance. Variants include representation distillation, multi‑step distillation with assistants, multi‑task distillation, and approaches like TinyBERT that apply two‑stage distillation to pre‑training and fine‑tuning phases.

Teacher‑Student model distillation framework
Teacher‑Student model distillation framework

How to Compute Faster?

The most effective way to accelerate training is to scale from single‑machine to multi‑machine setups, using model parallelism or data parallelism. Parameter Server (PS) architectures synchronise parameters across workers, but network communication can become a bottleneck.

Research focuses on two directions: developing new communication mechanisms (e.g., RingAllReduce) and reducing data exchange volume (e.g., gradient compression). RingAllReduce arranges nodes in a ring, allowing each node to exchange partial data with neighbours, completing synchronisation in 2·(N‑1) steps.

RingAllReduce communication diagram
RingAllReduce communication diagram

Gradient compression techniques such as DGC select only important gradients for transmission, achieving up to 600× compression with negligible accuracy loss, though they may affect convergence if too aggressive. Gradient accumulation and compensation remain common practical solutions.

DGC algorithm illustration
DGC algorithm illustration

Summary & Outlook

This survey examined computation‑ and system‑level acceleration techniques for deep learning models, highlighting industry research and practical advances. Model acceleration is a systematic engineering challenge that requires co‑design of algorithms and infrastructure, tailored to specific models and application scenarios.

Tencent will continue to explore gradient and parameter compression, as well as pre‑training optimisation, and will share further insights in future publications.

Summary and outlook illustration
Summary and outlook illustration
deep learningquantizationpruningmodel accelerationknowledge distillationdistributed training
Tencent Tech
Written by

Tencent Tech

Tencent's official tech account. Delivering quality technical content to serve developers.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.