Artificial Intelligence 10 min read

iQIYI M2VoC Multi‑Speaker Multi‑Style Voice Cloning Challenge (ICASSP 2021) Overview

The iQIYI M2VoC Challenge at ICASSP 2021 invites researchers to tackle low‑resource multi‑speaker, multi‑style voice cloning by providing Mandarin datasets, few‑shot and extremely few‑shot tracks with strict data rules, MOS‑based subjective evaluation, and a $9,600 prize pool for top submissions.

iQIYI Technical Product Team

Nov 20, 2020

iQIYI M2VoC Multi‑Speaker Multi‑Style Voice Cloning Challenge (ICASSP 2021) Overview

Text‑to‑speech (TTS) technology converts written text into natural‑sounding speech and has become a core component of intelligent voice assistants, audio books, and information broadcasting. Recent advances in deep learning, end‑to‑end synthesis frameworks, and neural vocoders have dramatically improved the naturalness of generated speech.

However, most high‑quality TTS systems rely on large, single‑speaker corpora. When only a few utterances are available for a target speaker—especially in multi‑speaker, multi‑style scenarios—the quality, expressiveness, and robustness of the synthesized voice degrade. This challenge defines the problem as Multi‑Speaker Multi‑Style Voice Cloning (M2VoC), a low‑resource voice cloning task.

The M2VoC Challenge, organized jointly by iQIYI and several research institutions for ICASSP 2021, provides a common dataset and a fair evaluation platform to stimulate research on low‑resource voice cloning.

Tracks and Sub‑tracks

Track 1 – Few‑Shot Track: Participants receive data for two (validation) and four (final) speakers, each with 100 utterances. Sub‑track 1A restricts system training to the provided data only; Sub‑track 1B allows the use of any publicly available data, provided the sources are disclosed.

Track 2 – Extremely Few‑Shot Track: Similar setup but each speaker provides only 5 utterances. The same sub‑track rules (1A, 1B) apply.

Evaluation and Ranking

Submissions are judged subjectively by listening tests using the following MOS‑based criteria (5‑point scale):

Speaker similarity

Speech quality

Style / expressiveness

Pronunciation accuracy

The weighted sum of these scores determines the final ranking for each sub‑task.

Datasets

Four datasets are released at different stages:

MST (Multi‑Speaker Training) : Combines AIShell‑3 (≈85 h, 218 speakers) and MST‑Originbeat (one male, one female speaker) recorded in high‑fidelity studios.

TSV (Target Speaker Validation) : For each track, two validation speakers (100 samples for Track 1, 5 samples for Track 2) with diverse speaking styles.

TST (Target Speaker Test) : Four test speakers per track (100 samples for Track 1, 5 samples for Track 2) used for final evaluation.

TT (Test Text) : A list of sentences and paragraphs that participants must synthesize for the test speakers.

All audio is mono, 44.1 kHz, 16‑bit PCM, with corresponding transcripts, in Mandarin Chinese.

Timeline (AoE)

2020‑11‑27: Detailed participation guide released.

2020‑12‑04: Registration deadline; MST‑Originbeat and TSV released.

2021‑01‑08: TST released.

2021‑01‑13: TT released.

2021‑01‑15: Final synthesis submission deadline.

2021‑01‑29: Evaluation results announced.

2021‑02‑05: System description paper deadline.

2021‑02‑11: ICASSP paper submission deadline.

Prize Money

Total prize pool: USD 9,600 (provided by iQIYI). The top two teams in each sub‑track receive:

1st place: USD 1,500

2nd place: USD 800

Registration

Researchers from academia and industry can register at http://challenge.ai.iqiyi.com/M2Voc before 2020‑12‑04 (AoE). Teams must follow the competition rules posted on the website.

Organizers & Contact

Committee members include professors from Northwestern Polytechnical University, National University of Singapore, Tsinghua University, and senior managers from iQIYI, Origin AI, and Shellbeike. For questions, email [email protected] .

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI ICASSP Challenge Speech synthesis voice cloning low-resource TTS

Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.