Multimodal Evolution and Application in Tencent Game Advertising System
This article describes the end‑to‑end multimodal modeling pipeline—covering text, image, and video understanding, model evolution from shallow to deep networks, key‑frame extraction, fine‑tuning, and multimodal fusion—used in Tencent's game ad exchange platform, along with practical deployment challenges and solutions.
System Business Overview
Our business focuses on advertising for Tencent games, operating as an ADX (Ad Exchange) platform that connects external media with game‑studio advertisers through real‑time bidding. Media send requests to the SSP, which filters fraudulent or device‑less traffic and forwards the cleaned traffic to the DSP. The DSP ranks candidate ads based on estimated revenue and returns the most profitable creative and bid to the media. Advertisers use the DMP to define target audiences and configure placement parameters such as budget, bidding strategy, and creative type (image or video).
Advertisers can obtain user behaviors such as clicks, game downloads, registrations, and recharges. Content understanding can be mined from two aspects: the content itself (titles, video descriptions, raw image/audio) and user behavior. By constructing graph sequences from content attributes and combining them with behavior sequences, we can improve cold‑start performance and capture user intent.
Multimodal Evolution Timeline
Stage 1: TimeSFormer
TimeSFormer (Time‑Space Transformer) is a recent Facebook‑AI open‑source video understanding model that achieves SOTA results on several datasets. Unlike CNN‑based video classifiers, TimeSFormer uses a transformer to capture long‑range spatio‑temporal dependencies by treating a video as a sequence of image patches.
In practice, we treat image ads as single‑frame videos (only spatial attention) and video ads with full TimeSFormer embeddings. However, applying the open‑source model directly to our image data did not improve AUC because the training data are much simpler than the model’s original video datasets, leading to biased feature extraction.
Stage 2: Native CNN Networks
After reflecting on Stage 1, we optimized in two dimensions: (1) start from proven industrial CNNs; (2) analyze video ad content and find that the latter part of a video often carries the advertising intent, which strongly influences user behavior.
2.1 CNN Selection
We evaluated various CNNs (AlexNet, RCNN, VGG, Inception, ResNet, etc.) using pretrained weights from ImageNet. NASNetLarge and Inception‑ResNet‑v2 achieved top‑2 accuracy, but NASNetLarge’s parameter count is 60 % larger. Considering resource‑effectiveness, we selected the second‑last layer of Inception‑ResNet‑v2 as the image embedding extractor.
2.2 Key‑frame Extraction
For video ads we need representative frames. Simple strategies (cover image, random frame, last frame) are insufficient because the advertising intent usually appears in the latter half of the video. We therefore extract key clips based on softmax confidence or LDA‑based scoring, and use the last P‑frame as the overall video representation, which yields modest AUC and click‑through improvements.
Stage 3: CNN Fine‑Tuning
Open‑source CNNs are trained on simple images and do not generalize well to our game‑ad domain. Fine‑tuning a pretrained model on our limited dataset (~6 k samples) gave only marginal AUC gains due to under‑fitting. We addressed this by (1) data augmentation (flipping, rotation, cropping, scaling, Gaussian noise, and Conditional GAN‑based domain transfer) expanding the training set to ~60 k images, and (2) enlarging the dataset with additional video‑derived frames and external game‑ad images, reaching >300 k samples, which raised validation accuracy to >0.84.
Stage 4: Text Classification Modeling
Textual information (creative titles, descriptions, tags) complements visual features. Traditional shallow models (FastText) provide comparable performance to deep networks with far less training time. FastText incorporates n‑gram features to preserve word order. Using FastText embeddings in the ranking model increased offline AUC by 0.0011.
Stage 5: Multimodal Fusion
We explore early, late, and intermediate fusion strategies. Early fusion concatenates modality‑level features before classification, often requiring dimensionality reduction (PCA, auto‑encoders). Late fusion combines modality‑specific model outputs via averaging, max‑pooling, Bayesian rules, or ensemble methods. Intermediate fusion (e.g., bilinear pooling, attention‑based mechanisms) allows richer cross‑modal interactions. We adopt a VILBERT‑style co‑Transformer with query‑only learnable parameters to keep inference latency low; this improves AUC by 0.003.
Online System Deployment
Media send requests via SDK/API to the SSP, which filters anti‑fraud traffic and forwards to the DSP. The DSP retrieves user features, performs recall, coarse ranking, and fine ranking. Feature concatenation involves fetching creative‑level ad data and user profiles from Redis, transforming and validating them before feeding into the model.
Multimodal Representation Usage
Multimodal embeddings can be generated online (real‑time inference) or offline (batch generation). Real‑time inference provides up‑to‑date features but adds latency due to large model size. Offline generation stores embeddings in a repository and retrieves them during serving, but this increases storage pressure and can cause feature‑concatenation failures.
To mitigate resource pressure, we embed multimodal representations into a hash‑table (O(1) lookup) within the model, converting string embeddings to tensors on‑the‑fly, incurring an additional 3‑5 ms latency. This approach is now used in production.
TP99 Latency Spike
After embedding multimodal features, TP99 latency spikes became more pronounced due to TensorFlow’s lazy graph loading. Warm‑up inference with sample inputs during model deployment reduced the spike dramatically.
Practical Takeaways
State‑of‑the‑art models may not transfer directly to a new domain; start with classic baselines and iterate.
When engineering solutions exist (e.g., using ffmpeg to extract key frames), prefer them over complex algorithmic pipelines.
Embedding multimodal features efficiently while keeping latency low remains an open challenge; community collaboration is encouraged.
IEG Growth Platform Technology Team
Official account of Tencent IEG Growth Platform Technology Team, showcasing cutting‑edge achievements across front‑end, back‑end, client, algorithm, testing and other domains.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.