Artificial Intelligence 22 min read

Multimodal Video Ad Second-Level Parsing: Algorithm Design and Baseline Analysis for the 2021 Tencent Advertising Algorithm Competition

This article details the algorithmic framework and baseline models for the 2021 Tencent Advertising Algorithm Competition, focusing on multimodal video ad parsing through temporal localization, scene segmentation, and multi-label classification to enhance advertising effectiveness and creative analysis.

Tencent Advertising Technology

May 27, 2021

Multimodal Video Ad Second-Level Parsing: Algorithm Design and Baseline Analysis for the 2021 Tencent Advertising Algorithm Competition

This article outlines the technical framework and baseline methodologies for the 2021 Tencent Advertising Algorithm Competition, focusing on the novel task of multimodal video ad second-level parsing. Video advertisements are characterized by short durations, fast pacing, and high information density, making granular semantic understanding crucial for creative analysis, automated production, and recommendation systems. The core commercial logic relies on the 4T framework: Attraction, Trust, Persuasion, and Action, which necessitates precise temporal and multimodal feature extraction.

To address this, the competition abstracts the business problem into a unified AI research task combining temporal localization, multimodal representation learning, and structured video tagging. The authors analyze two primary algorithmic routes: multi-stage pipelines and end-to-end architectures. While multi-stage methods offer higher reliability and interpretability through sequential shot boundary detection, scene aggregation, and classification, end-to-end models provide streamlined processing but face challenges with modality constraints and long-tail label distributions. Consequently, a multi-stage baseline is adopted for its robustness.

The baseline pipeline integrates TransNet v2 for accurate shot boundary detection, LGSS for local-to-global scene segmentation, and a multimodal classification network leveraging NextVLAD, BERT, and VGGish to fuse visual, textual, and audio features. The accompanying dataset contains 10,000 real-world ad videos with 28,277 annotated scene instances across 82 tags covering presentation formats, visual styles, and locations. Performance is evaluated using the mAP[[email protected]:0.95] metric, which measures average precision across varying temporal intersection-over-union thresholds to ensure rigorous assessment of second-level parsing accuracy.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Computer Vision advertising technology Temporal Segmentation

Written by

Tencent Advertising Technology

Official hub of Tencent Advertising Technology, sharing the team's latest cutting-edge achievements and advertising technology applications.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.