How Deep Learning Detects Pornographic and ASMR Audio
This article explains a deep‑learning pipeline that preprocesses audio, extracts FBank features, applies SpecAugment, and uses a CNN‑BI‑LSTM‑Attention model to automatically identify pornographic and ASMR speech for content moderation.
Problem Description
According to business requirements, pornographic speech and ASMR speech are considered prohibited content that must be automatically blocked. The task is to use deep‑learning models to recognize these audio types from massive incoming voice streams. Pornographic speech refers to male and female moaning sounds, while ASMR speech denotes "autonomous sensory meridian response"—a pleasant, tingling sensation triggered by auditory, visual, or tactile stimuli.
Solution and System Architecture
We build a deep neural network composed of convolutional layers, bidirectional LSTM layers, and attention mechanisms. After training, the model parameters are fixed for inference. The system consists of three main modules: data preprocessing, the deep neural network, and loss‑function design.
Data Preprocessing
Data preprocessing prepares audio for the neural network and differs slightly between training and prediction phases.
Training stage: acoustic feature FBank extraction and data augmentation (SpecAugment).
Prediction stage: only FBank extraction.
FBank features are extracted using a cepstral‑based method that aligns well with human auditory perception, making them the most common and effective acoustic features for speech tasks.
FBank Feature Extraction
The process includes framing and windowing, Fourier transform, Mel‑filterbank, and logarithm operations.
SpecAugment Data Augmentation
SpecAugment, proposed by Google, distorts the time‑domain signal and masks frequency and time channels in the spectrogram, improving model robustness against temporal deformations and partial frequency loss.
Deep Neural Network Model Design
After preprocessing, the audio is represented by FBank features, which pass through three stages:
CNN : further extracts local patterns from FBank features, emphasizing segments with strong prohibited cues.
BI‑LSTM : captures long‑range temporal dependencies and contextual information, which helps differentiate pornographic speech (often recorded in quiet rooms) from background noise.
Attention : focuses the model on the most discriminative features for classification.
The final fully‑connected layer and Softmax output the predicted class.
Conclusion
The article presented a deep‑learning solution for detecting pornographic and ASMR audio, covering data preprocessing, model architecture, and evaluation. The technology has been documented in a patent filing and aims to efficiently combat prohibited audio content.
NetEase Smart Enterprise Tech+
Get cutting-edge insights from NetEase's CTO, access the most valuable tech knowledge, and learn NetEase's latest best practices. NetEase Smart Enterprise Tech+ helps you grow from a thinker into a tech expert.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
