Multi-Modal Video Understanding and AIGC Video Generation at Autohome
This article presents a comprehensive multi-modal video understanding system for AIGC video generation, detailing technical architecture, GCN-based semi-supervised learning, and practical applications across automotive content scenarios.
This article presents a comprehensive multi-modal video understanding system developed by Autohome for AIGC (AI-Generated Content) video generation applications. The system addresses the growing trend of video content across platforms like Douyin, Kuaishou, and Bilibili, and Autohome's need to establish technical capabilities in this space.
The technical approach focuses on multi-modal video understanding as a critical component for AIGC video generation. The system leverages multiple advanced techniques including NeXtVLAD for efficient video classification, RNN-based GRU networks with attention mechanisms for temporal modeling, and various CNN architectures for spatial-temporal feature extraction.
The core network architecture consists of a mixed multi-modal network (mix-Multmodal Network) combined with a Graph Convolutional Network (GCN). The multi-modal network processes three modalities - text, audio, and video - each with three stages: basic semantic understanding, temporal feature understanding, and modal fusion. For video and audio, NeXtVLAD handles temporal features, while BERT processes text. Modal fusion employs a multi-group SENet structure to avoid information loss from feature compression.
The GCN component addresses the semi-supervised nature of the YouTube-8M dataset, where coarse-grained labels are fully annotated but fine-grained labels are only partially labeled. The system constructs a label correlation graph using conditional probability matrices derived from label co-occurrence statistics, enabling the model to learn label dependencies and improve classification performance.
To handle the multi-label classification challenges of YouTube-8M, the system implements weighted cross-entropy loss that assigns higher weights to annotated classes and lower weights to unannotated ones. Additionally, feature enhancement techniques include Gaussian noise injection and modal masking to prevent overfitting and ensure balanced learning across all modalities.
The system has been successfully applied across multiple automotive content scenarios including strong car categories (new car reports), data categories (car rankings), and weak car categories (off-road content, accident compilations). The implementation demonstrates practical effectiveness in real-world AIGC video generation applications.
The article concludes with author information and references to related technical articles on reinforcement learning applications and recommendation system architecture at Autohome.
HomeTech
HomeTech tech sharing
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.