How to Boost Real-Time Audio Quality with Advanced AEC, AGC, and ANC Techniques
This article details a comprehensive redesign of acoustic echo cancellation, automatic gain control, and automatic noise control for real‑time communication, combining WebRTC and Speex to improve delay estimation, linear filtering, and non‑linear processing, and demonstrates superior performance over the original WebRTC solution.
0. Project Background
In real‑time communication scenarios such as link‑mic, audio must be pre‑processed. The 3A technologies—Acoustic Echo Cancellation (AEC), Automatic Gain Control (AGC) and Automatic Noise Control (ANC)—are core to link‑mic. The original solution based entirely on the open‑source WebRTC stack suffered from complex configuration, poor performance, echo and double‑talk artifacts, and severe audio drop‑out.
For audio sampled above 16 kHz, a split‑band approach is used, simplifying high‑frequency processing.
Echo and flashing (audio drop‑out) frequently occur.
Configuration is complex and inconsistent across endpoints; many audio modules are still experimental.
Double‑talk processing causes severe audio loss.
1. Improved Overall Scheme
To address the shortcomings of the pure WebRTC solution, this proposal redesigns the AEC module by combining WebRTC with Speex‑based reconstruction and tuning. The overall architecture is shown below.
The complete AEC solution consists of:
Delay estimation module
Linear filtering module
Non‑linear residual echo removal module (post‑processing)
2. Delay Estimation Module
Before AEC processing, the remote reference signal and the echo must be time‑aligned; otherwise echo cannot be removed. The delay estimation module is a core algorithm that ensures proper AEC operation.
Its workflow is illustrated in the following diagram.
After an FFT, the spectra of the far‑end and near‑end signals (far_spectrum, near_spectrum) are obtained. The far‑end spectrum is stored as candidate matches. The 32 most significant frequency bins (12‑43) are selected, and a threshold spectrum is computed. Bins exceeding the threshold are set to 1, others to 0, producing binary spectra. By XOR‑ing the two binary spectra, the candidate far‑end signal with the highest similarity is chosen and its low‑delay value is calculated.
3. Linear Filtering Module
The linear filter uses Speex’s MDF (Multidelay‑Block Frequency‑domain) algorithm, comprising three parts: linear filter structure, double‑talk control, and optimal step‑size control.
3.1 Linear Filter Structure
The MDF structure implements an FIR filter in the frequency domain using an overlap‑and‑save block‑processing approach.
The MDF algorithm includes:
Block‑wise processing of the input signal with overlap‑and‑save convolution.
FFT‑based frequency‑domain convolution, reducing complexity from O(N²) to O(N log₂ N).
Segmentation of FIR coefficients into multiple sub‑filters, shortening block lengths and reducing filter latency.
3.2 Double‑Talk Control
During double‑talk, the filter must maintain tracking performance. Speex MDF employs a dual‑filter architecture: an adaptive Background Filter and a non‑adaptive Foreground Filter. When the adaptive filter diverges, the system falls back to the foreground result and resets the background filter; when the background filter recovers, its parameters are copied to the foreground. This implicit double‑talk detection is illustrated below.
The decision is based on the power difference between the two filters:
where Sff is the power of the foreground filter error, See is the power of the background filter error, and Dbf is the squared difference of the two filter outputs. If the background filter diverges excessively, it is reset to the foreground filter.
3.3 Optimal Step‑Size Control
The MDF uses a variable step‑size derived from the ratio of residual echo variance to error‑signal variance, i.e., the optimal step‑size equals the residual echo variance divided by the error‑signal variance.
A leakage factor η (0 ≤ η ≤ 1) is introduced to estimate residual echo power, updated by recursive averaging:
4. Non‑Linear Processing
Linear AEC cannot remove all echo due to adaptive filter limitations, poor speaker quality, and acoustic design, leaving nonlinear harmonic distortion. The Non‑Linear Processing (NLP) block eliminates this residual echo. NLP consists of spectral correlation calculation and spectral gain computation.
Correlation between far‑end and near‑end signals is used to estimate residual echo magnitude.
Key variables include dfw(n,f) (near‑end signal), xfw(n,f) (echo signal), efw(n,f) (linear filter error), and γ (smoothing factor, default 0.9). Additional metrics such as hNlXdAvg, hNlXdAvgWB, hNlDeAvg, hNlXeAvg, and near_level describe various correlations and speech activity levels.
Spectral gain computation is shown below.
5. Performance Demonstration
In pure‑echo scenarios, the improved AEC (dy_audio) outperforms the original WebRTC implementation, producing cleaner echo cancellation.
6. Conclusion and Outlook
The proposed scheme achieves better echo suppression in pure‑echo conditions and preserves near‑end speech during double‑talk, reducing audio drop‑out. However, performance degrades when echo energy is high, and music echo is less effectively removed than speech. Further research is needed to address these limitations.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Douyu Streaming
Official account of Douyu Streaming Development Department, sharing audio and video technology best practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
