Artificial Intelligence 8 min read

Multi-Modal Video Understanding and AIGC Video Generation at Autohome

This article presents a comprehensive multi-modal video understanding system for AIGC video generation, detailing technical architecture, GCN-based semi-supervised learning, and practical applications across automotive content scenarios.

HomeTech

Jul 7, 2023

Multi-Modal Video Understanding and AIGC Video Generation at Autohome

This article presents a comprehensive multi-modal video understanding system developed by Autohome for AIGC (AI-Generated Content) video generation applications. The system addresses the growing trend of video content across platforms like Douyin, Kuaishou, and Bilibili, and Autohome's need to establish technical capabilities in this space.

The technical approach focuses on multi-modal video understanding as a critical component for AIGC video generation. The system leverages multiple advanced techniques including NeXtVLAD for efficient video classification, RNN-based GRU networks with attention mechanisms for temporal modeling, and various CNN architectures for spatial-temporal feature extraction.

The core network architecture consists of a mixed multi-modal network (mix-Multmodal Network) combined with a Graph Convolutional Network (GCN). The multi-modal network processes three modalities - text, audio, and video - each with three stages: basic semantic understanding, temporal feature understanding, and modal fusion. For video and audio, NeXtVLAD handles temporal features, while BERT processes text. Modal fusion employs a multi-group SENet structure to avoid information loss from feature compression.

The GCN component addresses the semi-supervised nature of the YouTube-8M dataset, where coarse-grained labels are fully annotated but fine-grained labels are only partially labeled. The system constructs a label correlation graph using conditional probability matrices derived from label co-occurrence statistics, enabling the model to learn label dependencies and improve classification performance.

To handle the multi-label classification challenges of YouTube-8M, the system implements weighted cross-entropy loss that assigns higher weights to annotated classes and lower weights to unannotated ones. Additionally, feature enhancement techniques include Gaussian noise injection and modal masking to prevent overfitting and ensure balanced learning across all modalities.

The system has been successfully applied across multiple automotive content scenarios including strong car categories (new car reports), data categories (car rankings), and weak car categories (off-road content, accident compilations). The implementation demonstrates practical effectiveness in real-world AIGC video generation applications.

The article concludes with author information and references to related technical articles on reinforcement learning applications and recommendation system architecture at Autohome.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AIGC video understanding BERT Semi-supervised Learning graph convolutional networks multi-modal learning NeXtVLAD automotive content SENet YouTube-8M

Written by

HomeTech

HomeTech tech sharing

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.