Artificial Intelligence 11 min read

How Youku Tackles Multimodal Video Understanding and Quality Control

This article outlines Youku's multimodal video content understanding pipeline, covering business needs, problem decomposition, data construction, model selection, OCR subtitle extraction, scene and action recognition, sample augmentation, noise handling, and multimodal fusion strategies for robust content moderation.

Youku Technology

May 13, 2019

How Youku Tackles Multimodal Video Understanding and Quality Control

In a recent technical salon, Alibaba senior algorithm expert Fei Fei presented Youku's approach to multimodal video content understanding and quality control, focusing on two representative projects that illustrate the core technical ideas.

Business Context and Multimodal Definition

Youku processes massive user‑generated short videos with varying quality. The platform needs to (1) assign safety grades and risk tags for content moderation and (2) generate basic metadata for distribution and production. Video data contains three modalities: image (frames and cover), text (titles, comments, OCR/ASR output), and audio.

Technical Decomposition: Problem, Data, Model

The workflow is split into three layers:

Problem : Translate business requirements into concrete technical tasks and define evaluation metrics.

Data : Assess existing datasets, construct new labeled data (often synthetically for subtitles), and handle noisy industrial data.

Model : Choose appropriate architectures, fuse multiple modalities, and balance accuracy with inference cost.

Project 1 – Subtitle OCR

Subtitle extraction follows a two‑step pipeline: (1) detect text regions in video frames (object detection) and (2) recognize the characters within those regions (sequence recognition). Standard subtitles are horizontal text at the bottom of the frame, while non‑standard subtitles may appear in various orientations and styles.

Because no public subtitle dataset exists, Youku generated millions of synthetic samples using subtitle creation tools. Baseline models were selected from recent computer‑vision advances, then refined with anchor‑free detection and frame‑level temporal smoothing, exploiting the fact that subtitle positions are relatively fixed across frames.

The customized OCR achieved a 3‑4 percentage‑point accuracy gain over generic solutions, reaching over 97 % precision on high‑definition subtitles.

Project 2 – Scene and Action Recognition

The goal is to locate specific scenes (e.g., singing or dancing segments) within long videos. Three practical solutions are discussed:

Frame‑level image classification on extracted frames, followed by temporal aggregation.

Tagging the entire short video, suitable for brief clips.

Dividing the video into shots, classifying each shot, and merging shot‑level predictions.

For long‑form videos, the shot‑level approach combined with audio cues yields the best performance. Model selection again balances accuracy and inference cost, with attention to class imbalance, hard‑sample mining, and noisy labels.

Data Augmentation and Noise Handling

Sample augmentation (geometric transforms, color jitter, synthetic text overlay) mitigates data scarcity but must avoid over‑fitting. Noise‑heavy industrial data are often down‑weighted rather than discarded; recent studies suggest that soft‑weighting noisy samples improves model robustness.

Multimodal Fusion Strategies

Two main fusion paradigms are presented:

End‑to‑end fusion where all modalities are concatenated into a single feature space and processed by a unified model, requiring large training data but offering higher raw performance.

Modality‑specific models whose predictions are combined via label‑level fusion, providing better interpretability and flexibility for rapidly changing business requirements.

In practice, Youku adopts a hybrid: per‑modality models for OCR, scene, and audio, followed by a label‑level fusion that supports downstream tasks such as multimodal search and intelligent video clipping.

Conclusion

Multimodal video understanding remains a challenging research frontier in computer vision, especially when applied to large‑scale industrial scenarios. Progress in problem definition, dataset construction, model accuracy, and inference efficiency continues to unlock new capabilities for video production and distribution.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

computer vision AI OCR content moderation scene detection multimodal video action recognition

Written by

Youku Technology

Discover top-tier entertainment technology here.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Business Context and Multimodal Definition

Technical Decomposition: Problem, Data, Model

Project 1 – Subtitle OCR

Project 2 – Scene and Action Recognition

Data Augmentation and Noise Handling

Multimodal Fusion Strategies

Conclusion

Youku Technology

How this landed with the community

Was this worth your time?

0 Comments

Project 1 – Subtitle OCR

Project 2 – Scene and Action Recognition