How DeepXi and MHANet Revolutionize Speech Enhancement with Multi‑Head Attention
DeepXi introduces a two‑stage deep learning framework for speech enhancement, using prior SNR estimation and MMSE gain, while the MHANet extension leverages multi‑head attention to model long‑range dependencies, with detailed training strategies, model compression to GRU, deployment via TFLite, and impressive low‑latency results.
Background
Speech enhancement algorithms aim to improve perceived quality and intelligibility of noisy speech by suppressing background noise without distorting the speech.
Currently, deep learning methods are at the forefront, with deep neural networks (DNNs) used to map noisy speech magnitude spectra to clean spectra or noisy time‑domain frames to clean frames.
DeepXi Framework
DeepXi is a deep‑learning method for prior SNR estimation.
DeepXi consists of two stages:
Stage 1: Input noisy speech magnitude spectrum; a DNN estimates a mapped prior SNR, scaled to the [0,1] interval to accelerate SGD convergence.
Stage 2: The mapped prior SNR is used to compute an MMSE‑approximate gain function, which multiplies the noisy magnitude spectrum to obtain an estimate of the clean magnitude spectrum.
MHANet
Within the DeepXi framework, the DNN can be any architecture such as RNN or TCN.
Multi‑head attention (MHA) outperforms RNN and TCN in tasks like machine translation by modeling long‑range dependencies via sequence similarity. The DeepXi‑MHANet (DeepXi‑MHANet) incorporates MHA to effectively model long‑term dependencies of noisy speech.
MHANet details:
Input noisy speech magnitude |X|, add positional encoding, first layer projects to d_model, then B blocks output mapped prior SNR. Each block contains an MHA module, a two‑layer feed‑forward network, residual connections, and frame‑wise normalization.
MHA module: queries Q, keys K, values V; output is weighted sum of V with attention weights computed from Q‑K similarity. Each head uses masked scaled dot‑product attention; dimensions satisfy d_k = d_v = d_model / H.
Masked scaled dot‑product attention computes similarity via scaled dot product, optionally applies a mask, then softmax, and multiplies by V_h to produce attention‑enhanced values.
Training Strategy
Cross‑entropy loss.
Mini‑batch size = 10, 200 training iterations.
Each mini‑batch mixes clean speech with randomly selected noise at random start points and SNRs ranging from –10 to 20 dB in 1 dB steps.
Clean speech selection is random each iteration.
Optimizer: Adam with β₁=0.9, β₂=0.98, ε=10⁻⁹.
Learning rate α follows a warm‑up schedule: linearly increases until warm‑up steps ψ, then decays proportionally to the inverse square root of training steps.
Improvement Strategies
Because the Transformer model is large and unsuitable for edge deployment, a GRU‑based architecture was adopted.
Model Optimization
Replace Transformer with GRU, reducing parameters from 4.8 M to 0.3 M.
Limit model size to under 1 MB.
Apply data augmentation such as reverberation.
Deployment Solutions
Address data continuity issues caused by segment‑wise processing.
Adopt TFLite as the deployment framework.
Efficiently implement algorithmic operators like inverse error function and integration.
Results
The full deployment pipeline produces an algorithm library meeting client requirements.
Parameters: 300 k
Memory: ≤10 MB
Latency: 16 ms
Library size: ≤2 MB
Effect Demonstration
The algorithm combines traditional methods with deep learning to denoise while preserving speech quality. Sample denoising results include keyboard noise and wind/road noise.
Douyu Streaming
Official account of Douyu Streaming Development Department, sharing audio and video technology best practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.