Dual DNN Ranking Model with Online Knowledge Distillation for Recommender Systems
iQIYI’s dual‑DNN ranking model uses an online teacher‑student knowledge‑distillation framework where a complex teacher DNN shares representations with a lightweight student DNN, enabling end‑to‑end training, large‑scale feature crossing, and substantially higher recommendation accuracy while cutting inference latency and model size.
In recent years, with the rapid development of artificial intelligence, deep learning has been widely applied in various industrial scenarios. Compared with traditional machine‑learning models, deep learning can automatically construct features within the model, enabling end‑to‑end learning and achieving higher performance, but it also brings a new conflict between model effectiveness and inference efficiency.
iQIYI introduced an online knowledge‑distillation method to balance model performance and latency, obtaining notable improvements in both short‑video and image‑text recommendation streams. This article mainly presents iQIYI’s dual‑DNN ranking model that was developed to address this conflict.
The evolution of deep‑learning based ranking models can be divided into three stages. In the budding stage, DNNs were first incorporated into CTR models around 2013. The rising stage saw the emergence of models such as Wide & Deep (WDL) and DeepFM, which combine explicit low‑order feature crosses with implicit high‑order interactions. The breakthrough stage introduced deeper and wider models like DCN and xDeepFM, which explicitly model high‑order feature crosses using sophisticated DL components.
iQIYI’s baseline ranking model consists of a Wide side (FM) that receives GBDT‑generated features and a Deep side (DNN) that stacks on top of FM. This architecture suffers from three main drawbacks: (1) GBDT is a preprocessing component that cannot be updated in real time, preventing end‑to‑end training; (2) sparse and dense features are processed separately, and the model does not support explicit high‑order crossing of sparse features; (3) the overall structure is rigid and lacks flexibility for incorporating new ranking components.
Attempts to replace the baseline with more complex models such as DCN or xDeepFM revealed severe inference‑efficiency issues. On CPU, xDeepFM is 2.5–3.5 times slower than the baseline, and it only meets latency requirements on GPU with large batch sizes.
Knowledge distillation offers a solution by transferring the knowledge of a large, high‑performance model to a smaller, faster model.
Based on these observations, iQIYI defined three optimization directions: (1) upgrade the existing baseline ranking model; (2) enable large‑scale sparse feature crossing; (3) deploy high‑performance complex deep models in a practical manner.
The proposed dual‑DNN ranking model consists of two DNN‑based CTR models. The left DNN (teacher) is a complex model with superior performance but higher inference cost, while the right DNN (student) is a lightweight model designed for online inference. Both models share the same feature input and representation layers; the teacher additionally includes a Feature Interaction Layer that can host various feature‑cross components. The main layers are:
Embedding Layer: field‑wise average‑pooled embeddings for sparse features.
Feature Interaction Layer: core of the teacher, supporting both second‑order and higher‑order feature crosses.
Classifier Layer: multi‑layer DNN classifiers for each side.
The advantages of the dual‑DNN framework are threefold: (1) Feature Transfer – the student reuses the teacher’s input‑representation layer (copy‑and‑freeze); (2) Knowledge Distillation on the fly – during joint training, the teacher’s predictions guide the student, eliminating the need for a separate distillation stage; (3) Classifier Transfer – the teacher’s hidden layers serve as supervision signals for the student’s hidden layers, improving student performance without affecting the teacher.
Training architecture incorporates online learning: an offline dual‑DNN model is first trained on a 30‑day window, then fine‑tuned with the latest samples to capture recent user‑behavior patterns. Online learning uses the offline model as a warm‑start for the student, addressing OOV issues and preventing long‑term bias.
For inference, the model graph and embedding weights are decoupled and deployed via TensorFlow Serving with custom ops that fetch embeddings from a distributed parameter‑server, simplifying updates of large‑scale sparse features.
Experimental results in iQIYI’s information‑flow recommendation scenario show that the teacher’s inference latency is about five times that of the student, while the model size is reduced by more than three times. The student achieves higher QPS and better ROI under the same resource budget. Compared with a two‑stage training pipeline, the Co‑Train (joint‑training) mode further narrows the performance gap between teacher and student.
Related industry practices include Baidu’s CTR‑X and CTR‑3.0, which also employ teacher‑student training, and Alibaba’s Rocket Launching framework that jointly trains a light net and a booster net but differs in supervision strategy and embedding updates.
In summary, iQIYI’s online knowledge‑distillation method and dual‑DNN ranking model provide an effective way to balance model accuracy and latency. Future work will continue to push the boundaries of teacher network width and depth to further improve ranking precision.
iQIYI Technical Product Team
The technical product team of iQIYI
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.