Artificial Intelligence 13 min read

How a Low‑Latency Hierarchical Fusion Network Beats Echoes in Real‑Time Calls

At ICASSP 2023, Kuaishou’s audio team presented a low‑latency hierarchical fusion network for full‑band acoustic echo cancellation, detailing its multi‑stage design, asymmetric windowing, loss functions, training strategy, and achieving second place in the non‑personalized AEC Challenge, with real‑world deployment results.

Kuaishou Audio & Video Technology
Kuaishou Audio & Video Technology
Kuaishou Audio & Video Technology
How a Low‑Latency Hierarchical Fusion Network Beats Echoes in Real‑Time Calls

Background

Echo cancellation is a critical and complex problem in real‑time communications such as video conferencing, live streaming, and voice chat. Improper echo handling leads to users hearing their own voice or severe distortion, dramatically degrading the call experience.

Algorithm Design

Kuaishou’s audio‑video team designed a low‑latency hierarchical fusion network for full‑band acoustic echo cancellation (AEC). The framework consists of four stages: preprocessing, wide‑band prediction, high‑band prediction, and signal reconstruction.

Low System Latency

To achieve low system latency while preserving high‑frequency resolution, an asymmetric window pair is used in the time‑frequency conversion: a longer analysis window (1536 points) and a shorter synthesis window (960 points) at 48 kHz sampling rate, yielding a 20 ms overall delay.

Network Internal Information Fusion

Within the wide‑band network, a CrossNet architecture fuses information. The TCN‑CrossNet contains two stages: four stacked convolutional layers for feature extraction, followed by two parallel branches predicting speech and interference masks using three stacked TCNs each. Cross connections between branches enable internal information exchange.

To further improve performance, Convolutional Gated Linear Units (ConvGLU) replace standard convolutions, and a Dual Temporal Convolutional Module (DTCM) replaces the traditional TCN, using two parallel dilated convolutions with complementary dilation rates to capture long‑term and local dependencies.

Network Inter‑Information Fusion

A fusion module combines the strengths of GRU‑CrossNet and TCN‑CrossNet. It consists of three 2‑D convolutional layers, a GRU layer, and a softmax‑activated fully connected layer, outputting two weight matrices that modulate the predictions of each CrossNet. Temporal smoothing via a moving average prevents abrupt changes when switching between systems.

Full‑Band Echo Cancellation

The wide‑band and high‑band networks predict their respective spectra, which are then combined for full‑band echo cancellation. The wide‑band prediction guides the high‑band network because speech energy is concentrated in the lower frequencies, and the suppression pattern often extends to higher frequencies.

Loss Functions

The wide‑band network optimizes a combination of signal loss (OSISNR, MC‑MSE, cComp) and an ASR loss to improve both perceptual quality and speech‑recognition accuracy. The high‑band network uses OSISNR and MC‑MSE, as high‑frequency information has less impact on perceived quality.

Training Data and Method

Training data were sourced from the ICASSP 2022 AEC Challenge and DNS Challenge, with 16 kHz data derived by down‑sampling 48 kHz recordings. Data augmentation includes random SNR/SER settings, room impulse responses, equalization, band‑pass filtering, and random delays. A multi‑stage training regime first trains GRU‑CrossNet and TCN‑CrossNet separately, then fixes their parameters to train the fusion module, and finally fine‑tunes the high‑band network on top of the pretrained wide‑band model.

Experimental Evaluation

In the ICASSP 2023 AEC Challenge non‑personalized track, the proposed solution outperformed the baseline in both subjective quality scores and word‑accuracy, securing second place.

Conclusion

The low‑latency hierarchical fusion network designed by Kuaishou’s audio team delivers effective echo cancellation, noise reduction, and high speech quality while meeting strict latency and computational constraints. Deployed in Kuaishou’s live streaming, PK, and chatroom services, it markedly reduces echo and clipping rates.

References

H. Zhao et al., “A deep hierarchical fusion network for fullband acoustic echo cancellation,” ICASSP 2022, pp. 9112–9116.

D. Mauler et al., “A low delay, variable resolution, perfect reconstruction spectral analysis‑synthesis system for speech enhancement,” European Signal Processing Conference, 2007, pp. 222–226.

K. Tan et al., “Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement,” IEEE/ACM TASLP, vol. 28, 2019.

A. Li et al., “ICASS 2021 deep noise suppression challenge: Decoupling magnitude and phase optimization with a two‑stage deep network,” ICASSP 2021, pp. 6628–6632.

deep learningAudiolow latencysignal processingAcoustic Echo CancellationHierarchical Fusion Network
Kuaishou Audio & Video Technology
Written by

Kuaishou Audio & Video Technology

Explore the stories behind Kuaishou's audio and video technology.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.