How Rocket Launching Boosts Online CTR Prediction Without Slowing Inference
Rocket Launching introduces a novel co‑training framework that jointly trains a lightweight network and a more powerful booster network, sharing parameters and using gradient‑blocking and hint loss to improve click‑through‑rate prediction accuracy while keeping online inference latency unchanged, validated on public datasets and Alibaba’s ad system.
Abstract
Online response time directly determines the effectiveness and user experience of real‑time systems such as click‑through‑rate (CTR) prediction for display ads, where predictions must be made within a few milliseconds for hundreds of candidate ads. To improve model performance under strict latency constraints, we propose a new framework called Rocket Launching . During training, two networks of markedly different complexity are learned simultaneously: a lightweight light net and a more powerful booster net . The two networks share part of their parameters and learn from the same labels. The light net also learns from the booster net’s soft targets, enabling better training. At test time only the light net is used for prediction.
Existing Methods
Two main approaches address model latency: (1) model compression techniques such as MobileNet and ShuffleNet reduce inference time by decreasing model size or changing computation; (2) using a complex model to assist the training of a compact model, then deploying only the compact model at inference (e.g., knowledge distillation, MIMIC). These approaches are not mutually exclusive. Because training time is less constrained than inference, we adopt the second approach and design our method accordingly.
Motivation and Innovation
The training process is analogous to a rocket launch: the booster and the payload travel together initially, then the booster detaches, allowing the payload to continue alone. Similarly, the booster net guides the light net during training, and is removed during inference, improving prediction without extra cost.
Training Innovations
1. Co‑training of two models : Jointly training the light and booster nets shortens total training time compared with the traditional teacher‑student pipeline, and the booster provides continuous soft‑target guidance to the light net.
2. Gradient Block : During back‑propagation of the hint loss, gradients are blocked for parameters exclusive to the booster net, allowing it to learn freely from ground‑truth labels while still providing stable supervision to the light net.
Structural Innovation
The booster and light nets share lower‑level layers (e.g., early convolutional layers or embeddings), enabling the light net to inherit rich feature representations.
Method Framework
Figure 1 shows the overall architecture. During training, Light Net and Booster Net are learned simultaneously, sharing part of the parameters. The shared layers act as representation learning, while the upper layers perform task‑specific discrimination.
Loss Function
The total loss consists of three terms:
Light net loss on ground‑truth labels.
Booster net loss on ground‑truth labels.
Hint loss: mean‑square error between the logits (pre‑softmax outputs) of the two nets, encouraging them to produce similar predictions.
Co‑Training
The booster net continuously supervises the light net throughout training, providing richer guidance than a fixed teacher‑student setup where the teacher’s output is static.
Hint Loss
Our hint loss applies L2 loss to the logits, similar to the approach used in SNN‑MIMIC. This differs from Hinton’s KD, which uses KL divergence after softmax with a temperature parameter.
Gradient Block
To give the booster net maximal freedom, we block gradients for its exclusive parameters during back‑propagation of the hint loss, preventing the light net from influencing the booster’s learning.
Experimental Results
We evaluate each component of the method and compare against teacher‑student baselines such as Knowledge Distillation (KD) and Attention Transfer (AT) using Wide Residual Networks (WRN). The experimental network architecture is shown below:
Red and yellow denote the light net, blue and red denote the booster net. Different sharing schemes (shared bottom block vs. shared block per group) are explored.
Effect of Innovations
Parameter sharing and gradient blocking each contribute to performance gains.
Loss Comparisons
Light Net Depth Variation
Fixing the booster net and varying the depth of the light net consistently outperforms KD, demonstrating that the light net benefits from the booster’s guidance.
Visualization
Visualization experiments show that the light net learns low‑level group features from the booster net.
Public Dataset Comparisons
On CIFAR‑10, CIFAR‑100, and SVHN, our method consistently outperforms existing teacher‑student approaches, and further improves when combined with KD.
WRN‑16‑1 (0.2M parameters) denotes a wide residual network with depth 16 and width 1.
Real‑World Application
On Alibaba’s display ad dataset, our method improves GAUC by 0.3 % compared with using only the light net, while keeping inference latency unchanged.
Conclusion
Online response time is critical for real‑time systems. The Rocket Launching framework improves prediction accuracy without increasing inference time, offering a reliable solution for high‑traffic scenarios such as Double‑Eleven. It achieves up to an eight‑fold reduction in online computation while maintaining performance, thereby reducing resource consumption and preventing algorithmic degradation during peak traffic.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
