Artificial Intelligence 9 min read

Online Deep Learning (ODL) for Real‑Time Advertising Effectiveness: Challenges and Solutions

iQIYI’s minute‑level online deep‑learning framework overcomes stability, timeliness, compatibility, delayed feedback, catastrophic forgetting, and i.i.d. constraints through high‑availability pipelines, TensorFlow Example serialization, rapid P2P model distribution, flexible scheduling, disaster‑recovery rollbacks, PU‑loss adjustment, and knowledge‑distillation, delivering a 6.2% revenue boost.

iQIYI Technical Product Team

Oct 10, 2024

Online Deep Learning (ODL) for Real‑Time Advertising Effectiveness: Challenges and Solutions

In the context of performance advertising, media partners need to accurately measure the value of each request. Model predictions play a core role in ad bidding, and improving prediction accuracy directly drives higher monetization efficiency and ad revenue.

Previously, iQIYI’s advertising prediction models operated on an hourly basis, resulting in several hours of latency between ad delivery and feedback. Since the second half of 2023, the team upgraded the system to a minute‑level online deep learning (ODL) framework, achieving a 6.2% revenue increase. Compared with offline hourly models, ODL introduces engineering and effectiveness challenges that are summarized and addressed in this article.

ODL Engineering Challenges

Stability: The streaming pipeline must be highly robust to avoid back‑pressure or interruptions.

Timeliness: Model updates on the inference side need to be performed with minimal delay.

Compatibility: The framework should flexibly support offline/online modes and different model types such as pCTR and pCVR.

Model Effectiveness Challenges

Real‑time sample feedback delay.

Catastrophic forgetting of previously learned knowledge.

Requirement for independent and identically distributed (i.i.d.) samples.

Key Solutions

1. Service Robustness – The ODL pipeline processes samples every 5 minutes. A dual‑cluster high‑availability setup for the data‑lake feature snapshots ensures automatic failover when a cluster becomes unavailable.

2. Efficient Model Training – Initially, samples were serialized to JSON and sent to Kafka, causing heavy parsing overhead and limiting CPU utilization to 40%. By switching to TensorFlow Example serialization directly in Kafka, the consumption QPS increased tenfold, CPU usage rose to 100%, and the number of distributed training nodes was dramatically reduced.

3. Model Update Timeliness – The evaluator node was modified to export a new model every 10 minutes once batch‑size and time‑window requirements are met. Multi‑region deployment with parallel updates introduced a new bottleneck: thousands of containers downloading the model from S3 simultaneously. This was mitigated by enabling an icache P2P distribution layer within each region, relieving S3 pressure.

4. Framework Flexibility – The scheduling system can initialize the ODL model daily with parameters from a day‑level model, ensuring compatibility across day‑level, hour‑level, and online training. Custom attribution windows are supported to meet diverse model requirements.

5. Disaster Recovery – Continuous monitoring of service progress and online quality metrics (e.g., AUC, prediction bias) enables automatic rollback to the latest warm‑start version when anomalies are detected.

6. Addressing Delayed Feedback – Offline high‑quality day‑level samples are used as a baseline, while ODL samples wait for an attribution window before being labeled as positive or negative. A PU‑loss approach adjusts the cross‑entropy loss for samples that were initially labeled negative but later identified as positive.

7. Catastrophic Forgetting – Real‑time samples may drift from the global distribution. Knowledge distillation is applied by adding a soft loss term that aligns ODL predictions with those of the stable day‑level model.

8. i.i.d. Requirement – Hour‑level features are frozen during ODL training; the ODL model reuses the pretrained day‑level weights for the hour feature to preserve distributional properties.

Overall, the ODL solution has upgraded iQIYI’s core sparse models (pCTR, pCVR) to online learning, improving model timeliness by more than tenfold and contributing to a noticeable revenue uplift. Future work includes extending the framework to incorporate behavior sequences and multimodal signals, which will bring new challenges in training efficiency and model updating.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Advertising Deep Learning CTR Prediction online learning system engineering model serving Real-time inference

Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.