Multimedia Content Understanding in Meitu Community: Video Classification, Fingerprinting, and OCR
This article presents Meitu Community's AI‑driven multimedia content analysis pipeline, covering short‑video classification, video fingerprinting, and OCR, detailing model choices, experimental results, and future directions for improving content audit, quality, tagging, and feature engineering.
In the mobile‑internet era, images and short videos have exploded, making computer‑vision AI algorithms the foundation for multimedia content analysis; Meitu Community applies image and video classification, deduplication, and quality assessment across recommendation, search, and manual review scenarios.
Multimedia content understanding is divided into four directions—content audit, quality assessment, tagging, and feature engineering—while the highly diverse and unevenly distributed data pose significant challenges.
Short‑video classification is used in various Meitu products for recommendation, search, and moderation. Short videos are highly varied, with unbalanced categories and single‑angle captures, and consist of visual, audio, and textual modalities. Model selection includes a multimodal approach (NextVlad enhanced with text features, EfficientNet‑B3, and multi‑task loss) and single‑modal models (TSM, GSM). Experiments show modest multimodal gains (≈3 %) and that GSM with an added 128‑dimensional fully‑connected layer and optimized frame sampling improves accuracy by about 2 % while maintaining efficiency.
Video fingerprinting faces challenges such as added intros/outros, watermarks, resolution changes, and borders. The pipeline consists of a coarse recall using video‑level features followed by frame‑level comparison with the Smith‑Waterman algorithm. Feature extraction leverages a MoCo pre‑training task with data augmentations and center weighting via R‑MAC and attention mechanisms, achieving higher accuracy and recall than hash‑based methods.
OCR addresses business needs for extracting text from images and videos to generate titles, product descriptions, detect ads, and enforce safety policies. Difficulties include a large character set, varied orientations, and complex backgrounds. The solution trains a PSENet detector on synthetic data and a ResNet‑based recognizer with CTC, incorporating angle prediction and correction; online results show a 32.54 % boost in detection precision and a 10 % increase in recognition accuracy over previous models.
Conclusions highlight that multimodal fusion improves performance by 2‑3 %, strong pre‑training models are essential, and algorithms must be tightly coupled with business scenarios; future work aims to refine video tags, deepen multimodal research, and develop user‑facing features such as smart trimming and intelligent cover generation.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.