How End-to-End Deep Learning Boosts Real-Time Speech Enhancement
An end‑to‑end deep‑learning framework for speech enhancement is presented, detailing dataset creation, time‑domain feature extraction, a convolutional separation network, decoding, and training strategies using SI‑SIR loss with PIT, achieving a final SI‑SIR of 13 dB.
Background
Speech enhancement refers to extracting useful speech signals from noisy backgrounds, suppressing noise to improve quality and intelligibility. Deep learning is a primary method for achieving this, with applications in voice chat and real‑time communication.
Overall Architecture
Overall architecture summary:
End‑to‑end training framework consisting of three parts: feature extraction (encoding), separation network, and decoding.
Feature extraction: noisy signal directly transformed from raw time‑domain using convolutional linear encoding.
Separation network: fully convolutional architecture generating embeddings for speech and noise.
Decoding: linear decoding maps encoded signals back to time domain.
Detailed Design
Detailed scheme includes dataset creation, feature extraction (encoding module), separation network, decoding module, and loss‑function training strategy.
Dataset Creation
Clear speech data: ~20,000 4‑second clips (10k Chinese, 10k English).
Noise data: ~3,000 4‑second clips of various noises.
Mix speech and noise at SNR -10 to 10 dB.
Data augmentation: doubled.
Total duration: ~44 hours.
Data dimension: (32000,1).
Stored as TFRecord to accelerate training.
Feature Extraction
Frequency‑Domain Features and Limitations
Traditional deep‑learning inputs use frequency‑domain features (STFT, MFCC, …) with limitations:
STFT accuracy varies with window size; small windows give precise time but poor frequency resolution.
Large windows improve frequency resolution but lose temporal detail.
Common window lengths (512, 256) affect resolution and algorithm latency.
Some networks separate magnitude in frequency domain while using noisy phase for reconstruction, leading to errors.
Time‑Domain Feature Extraction and Advantages
Encoder is a linear converter (1‑D convolution, filter length 16, stride 8, 512 channels), transforming a 16‑sample input into a 512‑dimensional vector (3999×512).
Effective window length <4 ms for 8 kHz sampling, yielding low latency.
Using time‑domain input avoids phase issues.
Encoder filters learn to encode different frequencies; low‑frequency weights are higher, matching human voice characteristics.
Separation Network
Encoder outputs feed a separation network that produces two masks for speech and noise probabilities.
Network details:
Architecture similar to WaveNet, composed of many 1‑D CNN layers. Example: first layer processes three adjacent inputs, second layer processes three inputs with a stride, third layer processes inputs with stride 4, expanding receptive field to 12 samples.
Convolution types: regular (uses future frames, better denoising) and causal (uses only past/current frames, lower latency for real‑time).
Three repetitions of the CNN block allow the model to see up to 1.53 s of waveform; depthwise separable convolutions reduce parameters.
1×1 convolution and ReLU generate two masks, which multiply with encoder output and feed the decoder.
Decoding Module
Decoding linearly transforms the separation network output to a (3999,16) vector.
Decoder filters have similar frequency characteristics, emphasizing low frequencies for voice.
Overlap‑add method reconstructs separated speech and noise signals of dimension (32000,1).
Loss Function and Training Strategy
Loss uses scale‑invariant signal‑to‑interference ratio (SI‑SIR) and training employs permutation invariant training (PIT).
SI‑SIR definition is illustrated in the figure.
PIT strategy generates two possible assignments of outputs to targets and selects the lower loss, providing more accurate network adjustment.
Model Performance
The final model achieves an SI‑SIR of 13 dB.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Douyu Streaming
Official account of Douyu Streaming Development Department, sharing audio and video technology best practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
