Artificial Intelligence 8 min read

How Deep Learning Detects Pornographic and ASMR Audio

This article explains a deep‑learning pipeline that preprocesses audio, extracts FBank features, applies SpecAugment, and uses a CNN‑BI‑LSTM‑Attention model to automatically identify pornographic and ASMR speech for content moderation.

NetEase Smart Enterprise Tech+

Feb 23, 2021

How Deep Learning Detects Pornographic and ASMR Audio

Problem Description

According to business requirements, pornographic speech and ASMR speech are considered prohibited content that must be automatically blocked. The task is to use deep‑learning models to recognize these audio types from massive incoming voice streams. Pornographic speech refers to male and female moaning sounds, while ASMR speech denotes "autonomous sensory meridian response"—a pleasant, tingling sensation triggered by auditory, visual, or tactile stimuli.

Solution and System Architecture

We build a deep neural network composed of convolutional layers, bidirectional LSTM layers, and attention mechanisms. After training, the model parameters are fixed for inference. The system consists of three main modules: data preprocessing, the deep neural network, and loss‑function design.

Data Preprocessing

Data preprocessing prepares audio for the neural network and differs slightly between training and prediction phases.

Training stage: acoustic feature FBank extraction and data augmentation (SpecAugment).

Prediction stage: only FBank extraction.

FBank features are extracted using a cepstral‑based method that aligns well with human auditory perception, making them the most common and effective acoustic features for speech tasks.

FBank Feature Extraction

The process includes framing and windowing, Fourier transform, Mel‑filterbank, and logarithm operations.

SpecAugment Data Augmentation

SpecAugment, proposed by Google, distorts the time‑domain signal and masks frequency and time channels in the spectrogram, improving model robustness against temporal deformations and partial frequency loss.

Deep Neural Network Model Design

After preprocessing, the audio is represented by FBank features, which pass through three stages:

CNN : further extracts local patterns from FBank features, emphasizing segments with strong prohibited cues.

BI‑LSTM : captures long‑range temporal dependencies and contextual information, which helps differentiate pornographic speech (often recorded in quiet rooms) from background noise.

Attention : focuses the model on the most discriminative features for classification.

The final fully‑connected layer and Softmax output the predicted class.

Conclusion

The article presented a deep‑learning solution for detecting pornographic and ASMR audio, covering data preprocessing, model architecture, and evaluation. The technology has been documented in a patent filing and aims to efficiently combat prohibited audio content.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Audio Classification porn detection speech processing ASMR detection SpecAugment

Written by

NetEase Smart Enterprise Tech+

Get cutting-edge insights from NetEase's CTO, access the most valuable tech knowledge, and learn NetEase's latest best practices. NetEase Smart Enterprise Tech+ helps you grow from a thinker into a tech expert.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.