Detecting Pornographic Videos with Dual‑Modal AI: Images + Audio
This article presents a technical overview of a multimodal AI framework that combines image and audio analysis to identify pornographic video content, detailing model architectures, feature extraction methods, and experimental results achieving 93.4% accuracy on a 3,000‑sample test set.
Background
With the rapid growth of short‑video platforms, a large volume of user‑generated videos includes pornographic material that threatens internet safety and harms minors. Detecting such content is a multimodal problem, requiring both image and audio analysis because some videos appear benign visually but contain explicit audio.
Technical Challenges
Image‑based porn detection faces two main issues: the pornographic region often occupies a small portion of the frame, making recall difficult, and low‑quality or vulgar images can look visually similar to legitimate content. Audio‑based detection lacks established research, as existing speech classification methods are not directly applicable.
Proposed Framework
The solution consists of three components:
Pornographic image recognition model
Pornographic audio recognition model
Fusion of image and audio model outputs
1. Image Recognition Model
A custom network called DCNet is introduced, featuring parallel classification and detection branches to capture both global and local features. Two key optimizations are applied to the detection branch:
Feature fusion using BiFPN , which assigns adaptive weights to different feature maps and performs bidirectional fusion, improving detection performance.
An anchor‑free design based on fully convolutional networks (FCN) with a center‑point branch, enabling finer‑grained detection of small regions and reducing false positives.
The architecture diagram (see image) illustrates the dual‑branch structure.
2. Audio Recognition Model
Inspired by speech classification, the audio pipeline converts raw wav files into log Mel‑spectrograms , treating each second of audio as a 2‑D image. The model, named RANet , incorporates:
Segmentation of the spectrogram into equal‑duration clips, processed by a Temporal Segment Network (TSN) to capture temporal dynamics.
A frequency‑attention module inserted at both ends of a ResNet backbone, consisting of two convolutional layers that highlight salient frequency components.
The corresponding architecture diagram is included.
3. Fusion Strategy
Outputs from the image and audio models are combined to produce a final classification, leveraging complementary cues from both modalities.
Experimental Results
On a proprietary test set of 3,000 video samples, the integrated system achieved an accuracy of 93.4% . The result chart (see image) confirms the effectiveness of the multimodal approach.
Reference
The content is translated from a 2021 paper published in the open‑access journal Applied Sciences (doi: 10.3390/app11073066). Original article: https://www.mdpi.com/2076-3417/11/7/3066
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
