How to Speed Up Deep Learning Models: Cutting-Edge Acceleration Techniques
Deep learning models often suffer from slow training and deployment due to their size, but a range of advanced acceleration methods—including model architecture optimization, pruning, quantization, knowledge distillation, and distributed training techniques—can dramatically improve speed and efficiency while maintaining performance.
Why Accelerate Deep Learning Models?
In recent years deep learning has achieved remarkable results in image, text, speech, and recommendation domains. Tencent teams apply these models to real‑world services, but large data volumes and limited compute resources lead to slow training, high latency, and difficult deployment. Therefore models must be affordable and fast to be widely adopted.
Model acceleration aims for "fast, good, cheap" by improving training and inference speed through computation, system, and hardware optimizations.
How to Reduce Computation?
Computation‑optimisation techniques seek a balance between model performance and efficiency, focusing on four main approaches: model structure optimisation, pruning, quantisation, and knowledge distillation.
1. Model Structure Optimisation
Design lightweight computational components to replace heavy ones, often based on human experience. In CNNs, replacing fully‑connected layers with convolutional filters (e.g., NIN, VGG, GoogLeNet, SqueezeNet, MobileNet, ShuffleNet) reduces parameters while preserving accuracy. Similar ideas apply to sequence models such as QRNN, SRNN, and Transformer‑based architectures like Reformer.
2. Model Pruning
Pruning removes redundant parameters to shrink over‑parameterised models. Two categories exist: structured pruning (e.g., channel‑level, filter‑level) that yields regular sparse matrices, and unstructured pruning that creates sparse matrices but often lacks hardware support. Methods like Metapruning learn weight‑generation models to evaluate candidate structures without full retraining.
3. Model Quantisation
Quantisation reduces the bit‑width of weights and activations (e.g., FP16, mixed precision, INT8) to accelerate computation. While INT8 offers speed gains, it may degrade accuracy. Techniques such as value‑range adjustment, calibration, and quantisation‑aware training mitigate accuracy loss. Specialized quantisation (binary, ternary, XNOR) further compress models.
4. Knowledge Distillation
Distillation transfers knowledge from a large teacher model to a smaller student model, improving inference speed while retaining performance. Variants include representation distillation, multi‑step distillation with assistants, multi‑task distillation, and approaches like TinyBERT that apply two‑stage distillation to pre‑training and fine‑tuning phases.
How to Compute Faster?
The most effective way to accelerate training is to scale from single‑machine to multi‑machine setups, using model parallelism or data parallelism. Parameter Server (PS) architectures synchronise parameters across workers, but network communication can become a bottleneck.
Research focuses on two directions: developing new communication mechanisms (e.g., RingAllReduce) and reducing data exchange volume (e.g., gradient compression). RingAllReduce arranges nodes in a ring, allowing each node to exchange partial data with neighbours, completing synchronisation in 2·(N‑1) steps.
Gradient compression techniques such as DGC select only important gradients for transmission, achieving up to 600× compression with negligible accuracy loss, though they may affect convergence if too aggressive. Gradient accumulation and compensation remain common practical solutions.
Summary & Outlook
This survey examined computation‑ and system‑level acceleration techniques for deep learning models, highlighting industry research and practical advances. Model acceleration is a systematic engineering challenge that requires co‑design of algorithms and infrastructure, tailored to specific models and application scenarios.
Tencent will continue to explore gradient and parameter compression, as well as pre‑training optimisation, and will share further insights in future publications.
Tencent Tech
Tencent's official tech account. Delivering quality technical content to serve developers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.