Artificial Intelligence 22 min read

Multimodal Video Ad Second-Level Parsing: Algorithm Design and Baseline Analysis for the 2021 Tencent Advertising Algorithm Competition

This article details the algorithmic framework and baseline models for the 2021 Tencent Advertising Algorithm Competition, focusing on multimodal video ad parsing through temporal localization, scene segmentation, and multi-label classification to enhance advertising effectiveness and creative analysis.

Tencent Advertising Technology
Tencent Advertising Technology
Tencent Advertising Technology
Multimodal Video Ad Second-Level Parsing: Algorithm Design and Baseline Analysis for the 2021 Tencent Advertising Algorithm Competition

This article outlines the technical framework and baseline methodologies for the 2021 Tencent Advertising Algorithm Competition, focusing on the novel task of multimodal video ad second-level parsing. Video advertisements are characterized by short durations, fast pacing, and high information density, making granular semantic understanding crucial for creative analysis, automated production, and recommendation systems. The core commercial logic relies on the 4T framework: Attraction, Trust, Persuasion, and Action, which necessitates precise temporal and multimodal feature extraction.

To address this, the competition abstracts the business problem into a unified AI research task combining temporal localization, multimodal representation learning, and structured video tagging. The authors analyze two primary algorithmic routes: multi-stage pipelines and end-to-end architectures. While multi-stage methods offer higher reliability and interpretability through sequential shot boundary detection, scene aggregation, and classification, end-to-end models provide streamlined processing but face challenges with modality constraints and long-tail label distributions. Consequently, a multi-stage baseline is adopted for its robustness.

The baseline pipeline integrates TransNet v2 for accurate shot boundary detection, LGSS for local-to-global scene segmentation, and a multimodal classification network leveraging NextVLAD, BERT, and VGGish to fuse visual, textual, and audio features. The accompanying dataset contains 10,000 real-world ad videos with 28,277 annotated scene instances across 82 tags covering presentation formats, visual styles, and locations. Performance is evaluated using the mAP[[email protected]:0.95] metric, which measures average precision across varying temporal intersection-over-union thresholds to ensure rigorous assessment of second-level parsing accuracy.

computer visiondeep learningmultimodal learningAdvertising TechnologyTemporal Segmentation
Tencent Advertising Technology
Written by

Tencent Advertising Technology

Official hub of Tencent Advertising Technology, sharing the team's latest cutting-edge achievements and advertising technology applications.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.