Detecting Pornographic Videos with Dual‑Modal AI: Images + Audio

This article presents a technical overview of a multimodal AI framework that combines image and audio analysis to identify pornographic video content, detailing model architectures, feature extraction methods, and experimental results achieving 93.4% accuracy on a 3,000‑sample test set.

Baidu Geek Talk
Baidu Geek Talk
Baidu Geek Talk
Detecting Pornographic Videos with Dual‑Modal AI: Images + Audio

Background

With the rapid growth of short‑video platforms, a large volume of user‑generated videos includes pornographic material that threatens internet safety and harms minors. Detecting such content is a multimodal problem, requiring both image and audio analysis because some videos appear benign visually but contain explicit audio.

Technical Challenges

Image‑based porn detection faces two main issues: the pornographic region often occupies a small portion of the frame, making recall difficult, and low‑quality or vulgar images can look visually similar to legitimate content. Audio‑based detection lacks established research, as existing speech classification methods are not directly applicable.

Proposed Framework

The solution consists of three components:

Pornographic image recognition model

Pornographic audio recognition model

Fusion of image and audio model outputs

1. Image Recognition Model

A custom network called DCNet is introduced, featuring parallel classification and detection branches to capture both global and local features. Two key optimizations are applied to the detection branch:

Feature fusion using BiFPN , which assigns adaptive weights to different feature maps and performs bidirectional fusion, improving detection performance.

An anchor‑free design based on fully convolutional networks (FCN) with a center‑point branch, enabling finer‑grained detection of small regions and reducing false positives.

The architecture diagram (see image) illustrates the dual‑branch structure.

2. Audio Recognition Model

Inspired by speech classification, the audio pipeline converts raw wav files into log Mel‑spectrograms , treating each second of audio as a 2‑D image. The model, named RANet , incorporates:

Segmentation of the spectrogram into equal‑duration clips, processed by a Temporal Segment Network (TSN) to capture temporal dynamics.

A frequency‑attention module inserted at both ends of a ResNet backbone, consisting of two convolutional layers that highlight salient frequency components.

The corresponding architecture diagram is included.

3. Fusion Strategy

Outputs from the image and audio models are combined to produce a final classification, leveraging complementary cues from both modalities.

Experimental Results

On a proprietary test set of 3,000 video samples, the integrated system achieved an accuracy of 93.4% . The result chart (see image) confirms the effectiveness of the multimodal approach.

Reference

The content is translated from a 2021 paper published in the open‑access journal Applied Sciences (doi: 10.3390/app11073066). Original article: https://www.mdpi.com/2076-3417/11/7/3066

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Multimodal AIDeep Learningimage recognitionAudio Analysisvideo classificationporn detection
Baidu Geek Talk
Written by

Baidu Geek Talk

Follow us to discover more Baidu tech insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.