Artificial Intelligence 12 min read

Speaker-Aware Module for Single-Sample Voice Conversion (SAVC)

The paper presents a speaker‑aware module (SAM) that enables high‑quality voice conversion using only a single utterance of the target speaker, addressing the small‑data challenge in speech timbre transfer and achieving state‑of‑the‑art performance on the Aishell‑1 benchmark.

Kuaishou Tech
Kuaishou Tech
Kuaishou Tech
Speaker-Aware Module for Single-Sample Voice Conversion (SAVC)

Voice conversion (VC) transfers a speaker's timbre while preserving linguistic content, and is crucial for applications such as movie dubbing, impersonation, and voice cloning.

Although deep‑learning‑based VC has progressed rapidly, most methods require large amounts of target‑speaker data, making single‑sample VC a pressing research problem.

Researchers from Kuaishou MMU propose a Speaker‑Aware Module (SAM) that extracts a target speaker's timbre representation from just one utterance, enabling single‑sample VC; the work was accepted at ICASSP 2021 and a Chinese invention patent has been filed.

Paper link: https://ieeexplore.ieee.org/document/9414081

Core idea of single‑sample VC : first extract content features from speech, then decouple a target‑speaker feature vector from the single sample and fuse it with the content features to generate speech in the target timbre.

The SAVC system consists of:

A pre‑trained speaker‑independent ASR model (SI‑ASR) that outputs phonetic posteriorgrams (PPGs) representing content.

The Speaker‑Aware Module (SAM) that extracts a speaker embedding while suppressing content interference.

A decoder that combines PPGs and the speaker embedding to predict speaker‑dependent acoustic features.

An LPCNet vocoder that reconstructs the waveform from the predicted acoustic features.

SAM is inspired by speaker verification and attention mechanisms and contains three sub‑modules:

Reference Encoder : compresses variable‑length target‑speaker utterances into a fixed‑length vector R = RefEncoder(X) , making the representation robust to temporal variations.

Speaker Knowledge Base (SKB) : builds a matrix S = [S₁, S₂, …, S_N] of speaker vectors (e.g., 200‑dimensional x‑vectors) selected to be gender‑balanced; in the paper N = 200.

Multi‑Head Attention Layer : computes global similarity between the reference speaker vector and the SKB, using the standard attention formulation with queries Q, keys K, values V and dimension d_k.

The multi‑head attention computation is expressed as:

Four attention heads are used, with learnable projection matrices W_i^Q, W_i^K, W_i^V, and W^O.

The final target‑speaker embedding is obtained by fusing the attention output (see figure below).

Network parameters of SAM are listed in the accompanying table (image).

Experimental Comparison

The authors evaluate SAVC against several state‑of‑the‑art single‑sample VC models using the Aishell‑1 Chinese dataset (340 speakers for training, out‑of‑set speakers for testing). Both subjective and objective metrics show SAVC outperforming baselines.

Mel‑Cepstral Distortion (MCD) results demonstrate that SAVC‑GL achieves significantly lower distortion than INVC‑GL, and that improving the vocoder further reduces distortion.

Similarity scores (Figure 3) indicate SAVC attains the highest scores across all conversion pairs, with no noticeable gender bias thanks to the balanced SKB design.

Mean Opinion Score (MOS) for naturalness shows LPCNet‑based reconstruction outperforms Griffin‑Lim, especially under noisy conditions; SAVC‑GL and GST‑VC achieve comparable MOS, while MSVC lags due to mismatched speaker embeddings.

Demo videos (link: https://vcdemo-1.github.io/SAVC/savc.html) illustrate male‑to‑female and female‑to‑male conversions using a single reference utterance.

Applications

Voice‑changing technology is widely used at Kuaishou for short‑video editing, live‑stream voice alteration, and personalized user timbre customization; single‑sample VC dramatically reduces data collection costs and computational load, representing a major breakthrough for voice interaction services.

Kuaishou MMU

The Multimedia Understanding (MMU) team at Kuaishou handles large‑scale audio‑video content understanding, providing over 500 intelligent services across search, recommendation, ecosystem analysis, and risk control, and continuously recruits top talent in related fields.

deep learningSpeech Synthesisvoice conversionLPCNetsingle-samplespeaker-aware module
Kuaishou Tech
Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.