Artificial Intelligence 18 min read

How Advanced Video AI Transforms Content Moderation and Retrieval

This article explores how modern video AI techniques—ranging from transformer‑based classification to semi‑supervised retrieval and token‑halting acceleration—enable efficient, accurate detection of prohibited content and fast, scalable video search in the era of short‑form media.

NetEase Smart Enterprise Tech+

Mar 12, 2024

How Advanced Video AI Transforms Content Moderation and Retrieval

Introduction

With the rise of short‑video platforms, the volume of online video content has exploded, bringing a surge of low‑quality, violent, or otherwise prohibited material that threatens platform policies and viewer safety. Video analysis technology has become a crucial tool for content risk control.

Because video is information‑dense, intelligent and automated analysis is essential. Advances in video analysis improve machines' comprehensive judgment and understanding, reducing reliance on manual review, dramatically increasing efficiency and accuracy, and moving content moderation to the next level.

Why Frame‑Based Image Review Is Insufficient

Simple frame‑extraction and image‑based review can identify static information but fails to capture dynamic violations such as certain dance moves or fighting actions that only become apparent over time.

Advantages of Video‑Level Processing

Temporal Correlation: Captures time‑related changes between consecutive frames.

Contextual Information: Provides richer spatio‑temporal context, including object motion and scene changes.

Action Recognition: Enables motion tracking and accurate detection of dynamic behaviors.

Event Detection: Analyzes entire video segments to infer complex events.

Video Tag Classification Capability

We compared image‑classification and video‑classification solutions. Video classification outperforms static frame methods by capturing dynamic and sequential features.

Technical Solution (Video Classification)

Video classification has evolved from 2D‑CNN+Temporal models to 3D convolutions and now Transformer‑based approaches. Traditional CNN‑RNN or 3D‑CNN models struggle with long‑range dependencies, whereas Transformer self‑attention captures long‑distance relations effectively.

We adopted the TimeSformer model, treating video frames as a sequence and applying multi‑head self‑attention to learn spatio‑temporal features.

In addition to supervised learning, we explored semi‑supervised training using temporal contrastive learning: varying playback speed creates positive pairs, while different videos form negative pairs. We also employed Group Contrastive Loss to pull frames from the same video together and push frames from different videos apart.

Implementation Effects (Classification)

Using Divided Space‑Time Attention reduces complexity from O(NM) to O(N+M). On a low‑quality dance detection task, inference time is ~80 ms with 85.7 % accuracy, and semi‑supervised methods leverage unlabeled data effectively.

Video Feature Retrieval Capability

Beyond classification, video retrieval finds similar video segments. Frame‑level retrieval is accurate but slow; video‑level retrieval is fast but less precise. Our 3D‑CSL architecture uses a TimeSformer backbone to extract clip‑level 3D features, combining the strengths of both approaches.

We introduced a self‑supervised 3D context similarity learning strategy with positive pairs generated by random augmentations (cropping, clipping, padding) and negative pairs from other videos, optimized with Multi‑Similarity Loss. Additionally, a Flip‑Consistency (FCS) loss encourages the model to distinguish horizontally flipped clips.

Implementation Effects (Retrieval)

Our method achieves state‑of‑the‑art clip‑level retrieval on FIVR‑200K and CC_WEB_VIDEO, improving precision by 31 % over fast video‑level methods and reducing computational cost by 64× while cutting storage by eightfold. A two‑stage multi‑granularity retrieval pipeline first filters with video‑level features, then refines with clip‑level similarity, reducing average query time from seconds to milliseconds.

Video Algorithm Acceleration Capability

To address high computational cost, we designed HaltingVT, a dynamic video Transformer with token‑halting. Tokens are adaptively dropped during inference without a separate decision network, preserving spatio‑temporal information while cutting FLOPs.

HaltingVT incorporates a Glimpse sub‑network and Motion Loss to accelerate convergence. On Mini‑Kinetics, HaltingVT reaches 75.0 % top‑1 accuracy with 24.2 GFLOPs and 67.9 % accuracy with only 9.9 GFLOPs, outperforming comparable models.

Outlook

Emerging AIGC, multimodal alignment, and large‑model techniques promise further improvements in video analysis, such as cross‑modal cues, multimodal feature fusion, and AI‑generated content enhancement for denoising, restoration, and super‑resolution.

References

Is space‑time attention all you need for video understanding? ICML 2021.

Semi‑supervised action recognition with temporal contrastive learning. CVPR 2021.

Visil: Fine‑grained spatio‑temporal video similarity learning. ICCV 2019.

Learn from unlabeled videos for near‑duplicate video retrieval. SIGIR 2022.

3D‑CSL: Self‑supervised 3D context similarity learning for near‑duplicate video retrieval. ICIP 2023.

Ocsampler: Compressing videos to one clip with single‑step sampling. CVPR 2022.

D‑step: Dynamic spatio‑temporal pruning. BMVC 2022.

HaltingVT: Adaptive Token Halting Transformer for Efficient Video Recognition. ICASSP 2024.

Transformer semi-supervised learning video analysis video retrieval video classification AI moderation

Written by

NetEase Smart Enterprise Tech+

Get cutting-edge insights from NetEase's CTO, access the most valuable tech knowledge, and learn NetEase's latest best practices. NetEase Smart Enterprise Tech+ helps you grow from a thinker into a tech expert.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.