How Deep Learning Detects Pornographic and ASMR Audio

This article explains a deep‑learning pipeline that preprocesses audio, extracts FBank features, applies SpecAugment, and uses a CNN‑BI‑LSTM‑Attention model to automatically identify pornographic and ASMR speech for content moderation.

NetEase Smart Enterprise Tech+
NetEase Smart Enterprise Tech+
NetEase Smart Enterprise Tech+
How Deep Learning Detects Pornographic and ASMR Audio

Problem Description

According to business requirements, pornographic speech and ASMR speech are considered prohibited content that must be automatically blocked. The task is to use deep‑learning models to recognize these audio types from massive incoming voice streams. Pornographic speech refers to male and female moaning sounds, while ASMR speech denotes "autonomous sensory meridian response"—a pleasant, tingling sensation triggered by auditory, visual, or tactile stimuli.

Solution and System Architecture

We build a deep neural network composed of convolutional layers, bidirectional LSTM layers, and attention mechanisms. After training, the model parameters are fixed for inference. The system consists of three main modules: data preprocessing, the deep neural network, and loss‑function design.

Solution diagram
Solution diagram

Data Preprocessing

Data preprocessing prepares audio for the neural network and differs slightly between training and prediction phases.

Training stage: acoustic feature FBank extraction and data augmentation (SpecAugment).

Prediction stage: only FBank extraction.

FBank features are extracted using a cepstral‑based method that aligns well with human auditory perception, making them the most common and effective acoustic features for speech tasks.

FBank Feature Extraction

The process includes framing and windowing, Fourier transform, Mel‑filterbank, and logarithm operations.

FBank extraction flow
FBank extraction flow

SpecAugment Data Augmentation

SpecAugment, proposed by Google, distorts the time‑domain signal and masks frequency and time channels in the spectrogram, improving model robustness against temporal deformations and partial frequency loss.

SpecAugment example
SpecAugment example

Deep Neural Network Model Design

After preprocessing, the audio is represented by FBank features, which pass through three stages:

CNN : further extracts local patterns from FBank features, emphasizing segments with strong prohibited cues.

BI‑LSTM : captures long‑range temporal dependencies and contextual information, which helps differentiate pornographic speech (often recorded in quiet rooms) from background noise.

Attention : focuses the model on the most discriminative features for classification.

The final fully‑connected layer and Softmax output the predicted class.

Audio classification model diagram
Audio classification model diagram

Conclusion

The article presented a deep‑learning solution for detecting pornographic and ASMR audio, covering data preprocessing, model architecture, and evaluation. The technology has been documented in a patent filing and aims to efficiently combat prohibited audio content.

Audio Classificationporn detectionspeech processingASMR detectionSpecAugment
NetEase Smart Enterprise Tech+
Written by

NetEase Smart Enterprise Tech+

Get cutting-edge insights from NetEase's CTO, access the most valuable tech knowledge, and learn NetEase's latest best practices. NetEase Smart Enterprise Tech+ helps you grow from a thinker into a tech expert.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.