Artificial Intelligence 7 min read

iQIYI M2VoC Multi‑Speaker Multi‑Style Voice Cloning Challenge at ICASSP 2021 – Overview and Results

The iQIYI M2VoC competition at ICASSP 2021, the first low‑resource multi‑speaker, multi‑style voice‑cloning challenge, attracted 153 academic and industry teams to tackle few‑shot (100 utterances) and extreme few‑shot (5 utterances) tracks, evaluated by professional listeners, yielding strong innovations and applications while confirming that single‑sample cloning remains unsolved.

iQIYI Technical Product Team
iQIYI Technical Product Team
iQIYI Technical Product Team
iQIYI M2VoC Multi‑Speaker Multi‑Style Voice Cloning Challenge at ICASSP 2021 – Overview and Results

In recent years, advances in transfer learning, style transfer, vocoders, and acoustic models have opened up potential solutions for low‑resource voice cloning. iQIYI, together with the Audio Speech and Language Processing group of Northwestern Polytechnical University, National University of Singapore, Tsinghua University Shenzhen International Graduate School, Origin Intelligence, and Hill Shell, organized the Multi‑Speaker Multi‑Style Voice Cloning Competition (M2VoC) at ICASSP 2021.

The M2VoC challenge aims to provide a universal dataset and a fair testing platform for voice‑cloning research. It was one of the flagship tasks of the ICASSP 2021 Signal Processing Challenge, attracting many academic and industrial teams.

A total of 153 teams registered, including numerous academic institutions such as Peking University, Tsinghua University, Zhejiang University, Shanghai Jiao‑Tong University, National Taiwan University, Harbin Institute of Technology, University of Crete, Institute of Automation of the Chinese Academy of Sciences, University of Tsukuba, Nagoya University, Fudan University, Chinese University of Hong Kong, University of the Chinese Academy of Sciences, and University of Electronic Science and Technology of China, as well as internet companies like Huya, Microsoft, Didi, Tencent, and NetEase.

The competition featured two tracks: a few‑shot track (100 utterances per speaker with diverse speaking styles) and an extreme few‑shot track (5 utterances per speaker). Two base corpora, each containing 5,000 utterances of varied speaking styles, were provided for training baseline models. Submissions were evaluated using a weighted combination of four criteria: speaker similarity, speech quality, style/expressiveness, and pronunciation accuracy.

Evaluation was conducted in two rounds of subjective listening tests. The first round included all teams, while the second round focused on the top‑scoring teams. A sampling evaluation method was adopted to reduce the high cost of subjective assessment. Sixty‑six professional listeners participated in the first round and thirty in the second round; all listeners were native Chinese speakers, comprising linguistics students and professional annotators.

The challenge resulted in 18 submitted papers, of which six were accepted for publication in the ICASSP 2021 proceedings. The MOS (Mean Opinion Score) results for both tracks are illustrated in the original figures.

Participating teams introduced innovations across acoustic modeling, speaker representation, vocoder design, and speaker‑adaptation strategies, achieving strong performance. The outcomes have been applied to app narration, user‑generated content dubbing, audiobooks, and style‑controlled speech synthesis, especially for low‑quality, multi‑style audio scenarios.

In summary, the iQIYI M2VoC competition is the world’s first low‑resource voice‑cloning challenge, offering a common dataset and evaluation platform. It demonstrates that few‑shot voice cloning has made significant progress, yet single‑sample voice cloning remains an unsolved problem. Real‑world applications must also contend with noisy audio and constraints on training, adaptation, and inference time and cost.

iQIYI released related papers at ICASSP 2021, hoping that the competition’s results will spur further innovation in voice cloning, speech recognition, and broader artificial‑intelligence technologies, thereby expanding opportunities for the audiovisual industry.

AIAudio Processingfew-shot learningSpeech SynthesisICASSP2021M2VoCvoice cloning
iQIYI Technical Product Team
Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.