Rocket Launching: Boosting Real-Time CTR Prediction Without Extra Latency
Online click‑through‑rate (CTR) prediction demands millisecond‑level response times, yet deep models are too slow; this paper introduces a “Rocket Launching” framework that jointly trains a lightweight net and a powerful booster net, sharing parameters and using gradient‑blocking and hint loss to improve accuracy without increasing inference latency.
Abstract
Real‑time response systems, such as online advertising CTR estimation, require extremely low latency (a few milliseconds) while processing hundreds of candidate ads. Deep models with many layers cannot meet these strict latency constraints. To obtain models that satisfy latency limits and achieve superior performance, we propose a novel framework: during training we simultaneously train two networks of markedly different complexity—a lightweight network (light net) and a more powerful booster network (booster net)). The two networks share part of their parameters and learn from the same labels. The light net also learns from the booster net’s soft targets, improving its training. At test time only the light net is used for prediction.
We call this approach the “Rocket Launching” system. Experiments on public datasets and Alibaba’s online advertising system show that our method improves prediction performance without increasing online response time, demonstrating great value for real‑time models.
Existing Methods
Two main strategies address model latency: (1) compressing a fixed‑architecture model (e.g., MobileNet, ShuffleNet) and (2) using a complex model to assist the training of a compact model (knowledge distillation, MIMIC). The second strategy can be combined with the first, and because training time is less constrained than inference time, we adopt the second approach to design our method.
Motivation and Innovation
The training phase resembles a rocket launch: the booster (rocket booster) and the light net (payload) advance together; later the booster detaches, leaving the payload to continue alone. In our framework, the booster network guides the light net during training via parameter sharing and soft‑target supervision, then is removed at inference, yielding better predictions without extra cost.
Training Innovation
1. Co‑training of light and booster nets reduces total training time compared with the traditional teacher‑student pipeline, which trains the teacher first and then the student. The booster continuously provides soft‑target information, giving the light net richer guidance.
2. Gradient‑blocking technique: during back‑propagation of the hint loss (MSE between logits), we freeze the booster‑only parameters so that only the light net receives gradient updates, allowing the booster to learn freely from ground‑truth labels.
Structural Innovation
The booster and light nets share lower‑level layers (e.g., early convolutional layers or embedding layers), enabling the light net to inherit useful feature representations.
Method Framework
During training, Light Net and Booster Net are learned jointly, sharing part of their parameters. The loss consists of three terms: (1) light net cross‑entropy with ground truth, (2) booster net cross‑entropy with ground truth, and (3) hint loss (MSE between the two networks’ pre‑softmax logits) to align their predictions.
Co‑Training
The booster net supervises the light net throughout training, providing richer guidance than a fixed teacher‑student setup because the booster continues to evolve with each iteration.
Hint Loss
We adopt the same L2 loss on logits as in SNN‑MIMIC, aligning the two networks before softmax.
Gradient Block
When back‑propagating the hint loss, we block gradients for the booster‑only parameters, preventing the light net from influencing the booster’s learning.
Experimental Results
We evaluate each component’s necessity and compare against Knowledge Distillation (KD) and Attention Transfer (AT) using Wide Residual Networks (WRN). The network architectures are shown below.
Red + yellow denote light net, blue + red denote booster net. Variants share either the lowest block or the lowest block of each group, similar to AT’s attention transfer.
Effect of Each Innovation
Parameter sharing and gradient blocking each contribute to performance gains.
Loss Comparison
Light Net Depth Variation
Fixing the booster net and varying the light net’s depth consistently outperforms KD, showing the light net benefits from the booster’s guidance.
Visualization
Visualizations reveal that the light net learns low‑level group features from the booster net.
Public Dataset Comparison
On CIFAR‑10, CIFAR‑100, and SVHN, our method consistently surpasses existing teacher‑student approaches; combining with KD yields further improvements.
WRN‑16‑1 (0.2 M parameters) denotes a wide residual net with depth 16 and width 1.
Real‑World Application
On Alibaba’s display advertising dataset, our approach improves GAUC by 0.3 % over using only the light net, while keeping inference cost unchanged.
Conclusion
Response time is critical for online systems. The Rocket Launching training framework improves model prediction quality without increasing inference latency, offering a reliable solution for high‑traffic scenarios such as Double‑Eleven, where computational resources are scarce.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
