Artificial Intelligence 12 min read

Voice Conversion (VC): Fundamentals, Progress, and Applications

Voice conversion (VC) technology changes a speaker’s timbre and style while keeping the spoken text unchanged, supporting one‑to‑one, many‑to‑one, and many‑to‑many scenarios for medical assistance and entertainment, using parallel or non‑parallel data through methods such as DTW‑aligned frame mapping, attention‑based neural networks, PPG‑LSTM pipelines, VAEs, normalizing‑flow models, and GANs, with iQIYI focusing on non‑parallel data, prosody preservation, and noise‑robust augmentation.

iQIYI Technical Product Team
iQIYI Technical Product Team
iQIYI Technical Product Team
Voice Conversion (VC): Fundamentals, Progress, and Applications

Voice Conversion (VC) is a technology that transforms a speaker's voice into another timbre while keeping the linguistic content unchanged. It enables a wide range of entertaining and practical experiences.

In a recent iQIYI technical salon titled "Speech and Language Technology in Natural Interaction," senior R&D engineer Daniel Chen presented an overview of VC, covering its basics, recent advances, and future directions.

VC Goals : The technique aims to change the non‑linguistic information (speaker identity, speaking style, rhythm, etc.) while preserving the linguistic information (the spoken text). The two primary objectives are (1) converting the input audio’s timbre to that of a target speaker, and (2) adapting the speaking style to the target speaker’s manner.

Typical Applications : • Medical assistance – helping patients who have lost vocal organs (e.g., after laryngectomy) to speak more clearly. • Entertainment – providing humorous or stylized voices for short videos and other user‑generated content.

Conversion Scenarios : • One‑to‑one : Convert a single source speaker to a single target speaker. • Many‑to‑one : Convert many source speakers to a specific target speaker. • Many‑to‑many : Convert any source speaker to any target speaker without model constraints.

Data Types : • Parallel corpora – recordings where source and target speakers utter the same sentences. Early VC research focused on this type because alignment is straightforward. • Non‑parallel corpora – recordings with unrelated content, which are more realistic for practical use.

Parallel‑Data Methods :

1. Frame‑based conversion using direct mapping functions. When source and target utterances have different lengths, time alignment is required; the classic algorithm for this is Dynamic Time Warping (DTW).

2. Sequence‑based conversion that models relationships between frames (e.g., using attention mechanisms). This approach considers first‑order and second‑order features to improve prediction accuracy.

3. Neural‑network approaches with attention: a mel‑spectrogram and bottleneck features are encoded, then a phoneme discriminator aligns frames via attention, yielding high‑quality converted speech without DTW.

Non‑Parallel‑Data Methods :

• PPG‑based pipeline : First train a speaker‑independent ASR to obtain phoneme posteriorgrams (PPG). Then a deep LSTM converts PPGs to mel‑spectrograms, which are synthesized by a vocoder (e.g., STRAIGHT).

• Variational Auto‑Encoder (VAE) : The encoder compresses input speech into a latent variable that ideally contains only linguistic content. By injecting a target speaker’s identity vector into the decoder, the system reconstructs speech with the target timbre while preserving prosody.

• Normalizing Flow (Blow) : By applying a series of invertible transformations (F1, F2, …, FK) to map speech to a latent variable Z, one can replace the source speaker information with that of the target speaker and invert the flow to obtain converted speech.

• GAN‑based approaches : CycleGAN and StarGAN have been applied to VC. CycleGAN uses a generator to map source X to target Y and another generator to reconstruct X from Y, enforcing cycle consistency. StarGAN extends this to many‑to‑many conversion by conditioning on speaker identity.

Future Directions at iQIYI : The team plans to rely more on non‑parallel data to reduce user constraints, improve prosody preservation (especially for singing or recitation), and apply data‑augmentation techniques to mitigate noise interference.

Finally, VC differs from Text‑to‑Speech (TTS) in that VC converts speech‑to‑speech, whereas TTS converts text to speech. Nevertheless, both fields share techniques such as reference audio conditioning and expressive synthesis.

Artificial Intelligencedeep learningGANVAEAudio ProcessingSpeech Synthesisvoice conversion
iQIYI Technical Product Team
Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.