Artificial Intelligence 17 min read

Combining Knowledge Distillation, Exposure Forecasting, and Pacing to Guarantee Brand Exposure on Alibaba's Advertising Platform

Alibaba's advertising platform combines knowledge distillation to score traffic, exposure forecasting via GBDT, and PID-based pacing to guarantee contracted impression volumes while improving CTR/CVR, handling delayed exposure and traffic selection, achieving near‑perfect delivery in large promotions.

Alimama Tech
Alimama Tech
Alimama Tech
Combining Knowledge Distillation, Exposure Forecasting, and Pacing to Guarantee Brand Exposure on Alibaba's Advertising Platform

The article introduces how Alibaba's external advertising platform (AliMama) integrates knowledge distillation, exposure prediction, and rational pacing to meet advertisers' guaranteed brand exposure ("保量") requirements.

Background : Advertisers purchase contract ads to achieve a target number of impressions. In the splash-screen scenario, ads are pre‑loaded and only displayed on the N‑th app launch (N≥2), which creates two challenges for guaranteed exposure: the delivered ad may never be shown, and there can be delayed exposure. Advertisers also seek higher interaction metrics such as CTR and CVR.

Challenges :

Choosing which delivered traffic to fill with ads.

Delivered traffic does not guarantee exposure.

Exposure may be delayed.

Solution Overview : The platform controls two modules – return‑rate control and traffic delivery – while the media side decides actual exposure. The solution is divided into three parts:

3.1 Return‑Rate Control

Problem definition : A target return rate is set. The simplest method is random return with the target probability, but this ignores traffic quality. The platform defines a traffic‑value score; traffic with a score below a threshold is returned. The score is derived from the gap between actual and target return rates and the estimated value of the traffic.

Traffic‑value estimation : Directly scoring each request after creative retrieval is too costly. Instead, an offline KV table is used to fetch a CTR‑based value score using user_id and pid as keys. Because the request arrives before creative information, knowledge distillation is applied: a high‑capacity teacher model (with rich features, including creative) predicts CTR; its soft labels train a lightweight student model that only uses user and media identifiers.

The student model is trained jointly with the teacher. A temperature‑scaled softmax provides soft labels, and a combined loss (hard loss + soft loss weighted by a control parameter r ) mitigates the impact of an under‑fitted teacher.

3.1.3 Return‑Rate Adjustment Based on Traffic Value

The final traffic score is a function of the distilled CTR prediction and a PID‑based pacing factor. The platform adjusts the return rate every 5 minutes, comparing the cumulative return rate to the target and scaling the pacing factor accordingly. Experiments during the 618 promotion showed that the PID‑pacing algorithm kept exposure deviation within 2 % of the goal.

3.2 Traffic Delivery

The core objective is to guarantee the total exposure while releasing traffic evenly. Three modules are proposed:

Exposure forecasting – estimating potential future impressions.

Ad allocation – optimally distributing traffic among ads.

Vertical pacing – ensuring a smooth, uniform release.

3.3 Exposure Forecasting

At any moment the system knows the amount of traffic already delivered and the amount already exposed. Using this information, a model predicts the remaining potential impressions. The problem is treated as a regression (or time‑series) task; GBDT is chosen as the baseline due to its balance of accuracy and efficiency.

Features include time‑of‑day, historical traffic patterns, and recent delivery statistics. Training data are taken from large‑scale promotion periods. Offline validation across multiple media yields an average MAPE of 0.14; online results for three media (A, B, C) show MAPE values of 0.1167, 0.1432, and 0.1691 respectively.

3.4 Exposure Allocation & Pacing

After obtaining a media‑level exposure forecast, the total forecast is distributed to individual ads through a three‑step process:

Allocate the media forecast to each feature according to the feature importance derived from the GBDT model.

Distribute each feature’s exposure to ads based on the normalized weight of the ad on that feature.

Sum across all features to obtain the final potential exposure for each ad.

During delivery, each ad’s weight is the product of its completion rate and its allocated exposure. A PID controller adjusts the pacing every 5 minutes to keep all ads’ completion rates aligned, achieving both volume guarantees and interaction‑metric improvements.

Results & Outlook : Despite large variance in delayed exposure across media, the combined forecasting‑pacing system achieved 100 % exposure completion in both pre‑promotion and core‑promotion phases of the latest large‑scale campaign, exceeding the guaranteed volume. Future work includes exploring DeepAR, reinforcement learning, and other advanced techniques to further balance volume guarantees with user‑brand interaction metrics.

AlibabaCTR predictionmachine learningknowledge distillationadvertisingpacingexposure forecasting
Alimama Tech
Written by

Alimama Tech

Official Alimama tech channel, showcasing all of Alimama's technical innovations.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.