Artificial Intelligence 22 min read

Multimodal Evolution and Application in Tencent Game Advertising System

This article describes the end‑to‑end multimodal modeling pipeline—covering text, image, and video understanding, model evolution from shallow to deep networks, key‑frame extraction, fine‑tuning, and multimodal fusion—used in Tencent's game ad exchange platform, along with practical deployment challenges and solutions.

IEG Growth Platform Technology Team

Feb 14, 2022

Multimodal Evolution and Application in Tencent Game Advertising System

System Business Overview

Our business focuses on advertising for Tencent games, operating as an ADX (Ad Exchange) platform that connects external media with game‑studio advertisers through real‑time bidding. Media send requests to the SSP, which filters fraudulent or device‑less traffic and forwards the cleaned traffic to the DSP. The DSP ranks candidate ads based on estimated revenue and returns the most profitable creative and bid to the media. Advertisers use the DMP to define target audiences and configure placement parameters such as budget, bidding strategy, and creative type (image or video).

Advertisers can obtain user behaviors such as clicks, game downloads, registrations, and recharges. Content understanding can be mined from two aspects: the content itself (titles, video descriptions, raw image/audio) and user behavior. By constructing graph sequences from content attributes and combining them with behavior sequences, we can improve cold‑start performance and capture user intent.

Multimodal Evolution Timeline

Stage 1: TimeSFormer

TimeSFormer (Time‑Space Transformer) is a recent Facebook‑AI open‑source video understanding model that achieves SOTA results on several datasets. Unlike CNN‑based video classifiers, TimeSFormer uses a transformer to capture long‑range spatio‑temporal dependencies by treating a video as a sequence of image patches.

In practice, we treat image ads as single‑frame videos (only spatial attention) and video ads with full TimeSFormer embeddings. However, applying the open‑source model directly to our image data did not improve AUC because the training data are much simpler than the model’s original video datasets, leading to biased feature extraction.

Stage 2: Native CNN Networks

After reflecting on Stage 1, we optimized in two dimensions: (1) start from proven industrial CNNs; (2) analyze video ad content and find that the latter part of a video often carries the advertising intent, which strongly influences user behavior.

2.1 CNN Selection

We evaluated various CNNs (AlexNet, RCNN, VGG, Inception, ResNet, etc.) using pretrained weights from ImageNet. NASNetLarge and Inception‑ResNet‑v2 achieved top‑2 accuracy, but NASNetLarge’s parameter count is 60 % larger. Considering resource‑effectiveness, we selected the second‑last layer of Inception‑ResNet‑v2 as the image embedding extractor.

2.2 Key‑frame Extraction

For video ads we need representative frames. Simple strategies (cover image, random frame, last frame) are insufficient because the advertising intent usually appears in the latter half of the video. We therefore extract key clips based on softmax confidence or LDA‑based scoring, and use the last P‑frame as the overall video representation, which yields modest AUC and click‑through improvements.

Stage 3: CNN Fine‑Tuning

Open‑source CNNs are trained on simple images and do not generalize well to our game‑ad domain. Fine‑tuning a pretrained model on our limited dataset (~6 k samples) gave only marginal AUC gains due to under‑fitting. We addressed this by (1) data augmentation (flipping, rotation, cropping, scaling, Gaussian noise, and Conditional GAN‑based domain transfer) expanding the training set to ~60 k images, and (2) enlarging the dataset with additional video‑derived frames and external game‑ad images, reaching >300 k samples, which raised validation accuracy to >0.84.

Stage 4: Text Classification Modeling

Textual information (creative titles, descriptions, tags) complements visual features. Traditional shallow models (FastText) provide comparable performance to deep networks with far less training time. FastText incorporates n‑gram features to preserve word order. Using FastText embeddings in the ranking model increased offline AUC by 0.0011.

Stage 5: Multimodal Fusion

We explore early, late, and intermediate fusion strategies. Early fusion concatenates modality‑level features before classification, often requiring dimensionality reduction (PCA, auto‑encoders). Late fusion combines modality‑specific model outputs via averaging, max‑pooling, Bayesian rules, or ensemble methods. Intermediate fusion (e.g., bilinear pooling, attention‑based mechanisms) allows richer cross‑modal interactions. We adopt a VILBERT‑style co‑Transformer with query‑only learnable parameters to keep inference latency low; this improves AUC by 0.003.

Online System Deployment

Media send requests via SDK/API to the SSP, which filters anti‑fraud traffic and forwards to the DSP. The DSP retrieves user features, performs recall, coarse ranking, and fine ranking. Feature concatenation involves fetching creative‑level ad data and user profiles from Redis, transforming and validating them before feeding into the model.

Multimodal Representation Usage

Multimodal embeddings can be generated online (real‑time inference) or offline (batch generation). Real‑time inference provides up‑to‑date features but adds latency due to large model size. Offline generation stores embeddings in a repository and retrieves them during serving, but this increases storage pressure and can cause feature‑concatenation failures.

To mitigate resource pressure, we embed multimodal representations into a hash‑table (O(1) lookup) within the model, converting string embeddings to tensors on‑the‑fly, incurring an additional 3‑5 ms latency. This approach is now used in production.

TP99 Latency Spike

After embedding multimodal features, TP99 latency spikes became more pronounced due to TensorFlow’s lazy graph loading. Warm‑up inference with sample inputs during model deployment reduced the spike dramatically.

Practical Takeaways

State‑of‑the‑art models may not transfer directly to a new domain; start with classic baselines and iterate.

When engineering solutions exist (e.g., using ffmpeg to extract key frames), prefer them over complex algorithmic pipelines.

Embedding multimodal features efficiently while keeping latency low remains an open challenge; community collaboration is encouraged.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

CNN Advertising Transformer Multimodal Learning video analysis online serving Text Classification

Written by

IEG Growth Platform Technology Team

Official account of Tencent IEG Growth Platform Technology Team, showcasing cutting‑edge achievements across front‑end, back‑end, client, algorithm, testing and other domains.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.