How Multimodal AI Detects Pornographic Videos: Image & Audio Fusion Explained
This article outlines a multimodal AI framework for detecting pornographic video content by combining image and audio analysis, detailing the challenges of visual and speech-based recognition, describing the DCNet and RANet model architectures, fusion strategies, and reporting experimental accuracy of 93.4% on a 3k test set.
Background
With the rise of mobile internet, short videos are a primary entertainment medium, but many contain pornographic content that harms youth and threatens social safety.
Technical Challenges
Porn video detection is a multimodal problem involving image and audio recognition. Image detection faces issues such as small pornographic regions and visual similarity between vulgar and pornographic images. Audio‑based detection lacks established theory.
Detection Techniques
Traditional image methods rely on handcrafted features like color histograms, which cannot distinguish vulgar from pornographic images. Deep learning approaches improve but have limited model complexity.
Audio classification typically converts wav to spectrograms and applies 2‑D convolutions; recent work uses log Mel‑spectrograms and a proposed RANet.
Proposed Framework
The framework combines image and audio modalities. It consists of three parts: a porn image recognition model, a porn audio recognition model, and a fusion of their results.
Porn Image Recognition Model
We propose DCNet, which includes classification and detection branches to capture global and local features. The detection branch uses BiFPN for weighted bidirectional feature fusion and an anchor‑free design with FCN‑style dense detection and a center‑point branch to reduce false positives.
Porn Audio Recognition Model
Audio is transformed into log Mel‑spectrograms (one per second) and processed with a TSN‑based architecture to capture temporal information. A frequency‑attention module, consisting of two convolutional layers inserted into a ResNet, extracts key sound features.
Fusion of Image and Audio Results
The outputs of the two models are merged to produce the final porn video classification.
Experimental Results
On a test set of 3,000 videos, the combined model achieved an accuracy of 93.4%.
The content is a translation of a 2021 paper published in the open‑access journal Applied Sciences.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
