Artificial Intelligence 9 min read

How End-to-End Deep Learning Boosts Real-Time Speech Enhancement

An end‑to‑end deep‑learning framework for speech enhancement is presented, detailing dataset creation, time‑domain feature extraction, a convolutional separation network, decoding, and training strategies using SI‑SIR loss with PIT, achieving a final SI‑SIR of 13 dB.

Douyu Streaming

Oct 15, 2021

How End-to-End Deep Learning Boosts Real-Time Speech Enhancement

Background

Speech enhancement refers to extracting useful speech signals from noisy backgrounds, suppressing noise to improve quality and intelligibility. Deep learning is a primary method for achieving this, with applications in voice chat and real‑time communication.

Overall Architecture

Overall architecture summary:

End‑to‑end training framework consisting of three parts: feature extraction (encoding), separation network, and decoding.

Feature extraction: noisy signal directly transformed from raw time‑domain using convolutional linear encoding.

Separation network: fully convolutional architecture generating embeddings for speech and noise.

Decoding: linear decoding maps encoded signals back to time domain.

Detailed Design

Detailed scheme includes dataset creation, feature extraction (encoding module), separation network, decoding module, and loss‑function training strategy.

Dataset Creation

Clear speech data: ~20,000 4‑second clips (10k Chinese, 10k English).

Noise data: ~3,000 4‑second clips of various noises.

Mix speech and noise at SNR -10 to 10 dB.

Data augmentation: doubled.

Total duration: ~44 hours.

Data dimension: (32000,1).

Stored as TFRecord to accelerate training.

Feature Extraction

Frequency‑Domain Features and Limitations

Traditional deep‑learning inputs use frequency‑domain features (STFT, MFCC, …) with limitations:

STFT accuracy varies with window size; small windows give precise time but poor frequency resolution.

Large windows improve frequency resolution but lose temporal detail.

Common window lengths (512, 256) affect resolution and algorithm latency.

Some networks separate magnitude in frequency domain while using noisy phase for reconstruction, leading to errors.

Time‑Domain Feature Extraction and Advantages

Encoder is a linear converter (1‑D convolution, filter length 16, stride 8, 512 channels), transforming a 16‑sample input into a 512‑dimensional vector (3999×512).

Effective window length <4 ms for 8 kHz sampling, yielding low latency.

Using time‑domain input avoids phase issues.

Encoder filters learn to encode different frequencies; low‑frequency weights are higher, matching human voice characteristics.

Separation Network

Encoder outputs feed a separation network that produces two masks for speech and noise probabilities.

Network details:

Architecture similar to WaveNet, composed of many 1‑D CNN layers. Example: first layer processes three adjacent inputs, second layer processes three inputs with a stride, third layer processes inputs with stride 4, expanding receptive field to 12 samples.

Convolution types: regular (uses future frames, better denoising) and causal (uses only past/current frames, lower latency for real‑time).

Three repetitions of the CNN block allow the model to see up to 1.53 s of waveform; depthwise separable convolutions reduce parameters.

1×1 convolution and ReLU generate two masks, which multiply with encoder output and feed the decoder.

Decoding Module

Decoding linearly transforms the separation network output to a (3999,16) vector.

Decoder filters have similar frequency characteristics, emphasizing low frequencies for voice.

Overlap‑add method reconstructs separated speech and noise signals of dimension (32000,1).

Loss Function and Training Strategy

Loss uses scale‑invariant signal‑to‑interference ratio (SI‑SIR) and training employs permutation invariant training (PIT).

SI‑SIR definition is illustrated in the figure.

PIT strategy generates two possible assignments of outputs to targets and selects the lower loss, providing more accurate network adjustment.

Model Performance

The final model achieves an SI‑SIR of 13 dB.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

deep learning speech enhancement end-to-end model PIT SI-SIR time-domain features

Written by

Douyu Streaming

Official account of Douyu Streaming Development Department, sharing audio and video technology best practices.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.