Advances in Single‑Channel Speech Separation and Target Speaker Extraction with Iterative Refined Adaptation
The article surveys recent advances in single‑channel speech separation and target‑speaker extraction, explains the encoder‑separator‑decoder framework, compares frequency‑ and time‑domain methods, highlights models such as SpEx+, DPRNN‑Spe, and introduces Iterative Refined Adaptation, which iteratively improves speaker embeddings to boost SI‑SDR performance and enables effective speaker‑suppression for applications like in‑vehicle voice interaction.
This article introduces the research status, principles, and recent progress of single‑channel speech separation and target speaker extraction, focusing on practical applications such as in‑vehicle voice interaction, customer‑service listening, and navigation‑sound suppression.
Problem definition : Speech separation aims to separate multiple speakers from a mixed audio signal, while target speaker extraction isolates the voice of a pre‑specified speaker using a reference embedding.
Performance overview : A comparison of mainstream methods on the WSJ0‑2mix (clean) and WHAM! (noisy) datasets shows significant SI‑SDR improvement in recent years. Methods before PSM are frequency‑domain based; after PSM, most approaches (except deep CASA) are time‑domain.
Encoder‑Separator‑Decoder framework : The encoder transforms the mixture into a latent 2‑D space (e.g., STFT or 1‑D CNN), the separator predicts masks for each source, and the decoder reconstructs time‑domain signals. This framework unifies both frequency‑ and time‑domain methods.
Frequency‑domain methods benefit from interpretability of masks but suffer from phase reconstruction difficulty, longer latency, and limited robustness to noise.
Time‑domain methods (e.g., Conv‑TasNet, DPRNN‑TasNet) avoid phase issues and achieve low latency, but masks are less interpretable.
Target speaker extraction requires a speaker embedding (e.g., d‑vector) as reference. Early work such as Google’s VoiceFilter used a fixed embedding; later approaches employ an auxiliary network to jointly learn high‑quality embeddings.
Representative models : SpEx and SpEx+ (based on Conv‑TasNet) achieve strong performance; SpEx+ shares weights between speech and speaker encoders. DPRNN‑Spe replaces the TCN in SpEx+ with a DPRNN for a more compact model.
Iterative Refined Adaptation (IRA) : Inspired by auditory perception, IRA iteratively refines the speaker embedding by feeding the first extraction result back to the auxiliary network, producing a more accurate embedding for subsequent extraction passes. Experiments show consistent SI‑SDR gains on both WSJ0‑2mix‑extr (clean) and WHAM! (noisy) test sets, especially for unseen speakers.
Speaker suppression applies the same extraction pipeline to remove unwanted speech (e.g., navigation audio). Compared with traditional acoustic echo cancellation, speaker suppression achieves higher removal quality without harming near‑end speech.
Experimental results : Tables and figures demonstrate that IRA improves model robustness in both clean and noisy conditions, and that speaker suppression effectively eliminates navigation sounds while preserving speech integrity.
The article concludes with ongoing research directions, including further robustness improvements for unregistered speakers and the impact of training negative samples on speaker suppression performance.
Didi Tech
Official Didi technology account
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.