How a Lightweight Neural Network Cuts Transient Noise in Real‑Time Audio

NetEase Cloud Communication’s Audio Lab presents a low‑complexity neural‑network denoising algorithm that effectively suppresses both stationary and transient noises while preserving speech quality, detailing its mathematical model, feature design, loss function, GRU‑based architecture, real‑time performance, and comparative evaluation against state‑of‑the‑art methods.

NetEase Smart Enterprise Tech+
NetEase Smart Enterprise Tech+
NetEase Smart Enterprise Tech+
How a Lightweight Neural Network Cuts Transient Noise in Real‑Time Audio
NetEase Cloud Communication Audio Lab has independently developed a lightweight neural network audio denoising algorithm (NetEase Cloud AI Audio Denoising) that achieves strong suppression of non‑stationary and transient noises while controlling speech signal distortion, ensuring high speech quality and intelligibility.

Background

Traditional signal‑processing based audio denoising works well for stationary noise but performs poorly on non‑stationary and transient noises, often damaging speech. Deep‑learning‑based denoising methods have emerged to address these shortcomings, yet they typically demand high computational resources.

Challenges

Neural‑network denoising faces high computational complexity, making real‑time execution on most devices (especially those without GPUs) difficult. Existing low‑overhead neural denoisers still exceed the computational budget of real‑time communication SDKs.

Proposed Solution

The lab’s algorithm targets transient noise with a lightweight network, achieving denoising performance comparable to traditional methods such as MMSE while maintaining very low computational cost. It has been integrated into NetEase’s next‑generation audio‑video SDK (NERTC) and runs on a wide range of devices, including many low‑end models.

Method

The problem is modeled as follows: the noisy signal x(n), clean speech s(n), and noise d(n) are represented in the time domain, then transformed to the frequency domain via STFT.

Subsequent equations (shown as images) define the frequency‑domain representations of x(n), s(n), and d(n), and the estimation of the clean speech signal using a gain factor.

Feature Representation

To keep computation low, the model compresses size while selecting features that best distinguish speech from noise. Instead of using full magnitude and phase, the study adopts pitch‑correlation‑based features and introduces a novel Harmonic‑Correlation that captures inter‑frame information, improving discrimination of transient noise.

Loss Function

Following Valin’s approach, the loss combines squared and fourth‑power errors between the estimated gain and ground‑truth gain, encouraging accurate convergence while mitigating local minima and excessive error from high‑order terms.

Model Architecture and Real‑Time Processing

The system uses an RNN‑GRU model because recurrent networks retain temporal information crucial for speech. After training, the model is embedded in the NetEase SDK, which extracts features from audio buffers, feeds them to the network, obtains gain values, and applies them to the signal.

Evaluation and Discussion

The algorithm was tested on a separate dataset and compared with traditional MMSE, RNNoise, DNS‑Net, and DTLN. Results show that the proposed method offers comparable or better speech quality (STOI, MOS) while maintaining lower computational load.

In keyboard‑noise scenarios (a typical transient noise), the AI denoiser almost completely removes the noise in non‑speech regions and significantly attenuates it in speech regions while preserving intelligibility.

Quantitative tables (omitted for brevity) demonstrate that the proposed features and loss function yield the best trade‑off between noise reduction and speech quality.

Performance

The AI denoiser processes 10 ms audio frames (16 kHz) with roughly 400 k floating‑point operations. Using NetEase’s NENN inference engine, it runs on an iPhone 12 in under 0.01 ms per frame, with CPU usage below 0.02%.

Conclusion

The study delivers a lightweight, real‑time neural network audio denoising solution that effectively handles stationary, non‑stationary, and transient noises while preserving speech quality, offering a competitive edge over existing AI denoisers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

neural networkreal-time processingspeech enhancementlow-complexityaudio denoisingtransient noise
NetEase Smart Enterprise Tech+
Written by

NetEase Smart Enterprise Tech+

Get cutting-edge insights from NetEase's CTO, access the most valuable tech knowledge, and learn NetEase's latest best practices. NetEase Smart Enterprise Tech+ helps you grow from a thinker into a tech expert.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.