iQIYI M2VoC Multi‑Speaker Multi‑Style Voice Cloning Challenge (ICASSP 2021) Overview
The iQIYI M2VoC Challenge at ICASSP 2021 invites researchers to tackle low‑resource multi‑speaker, multi‑style voice cloning by providing Mandarin datasets, few‑shot and extremely few‑shot tracks with strict data rules, MOS‑based subjective evaluation, and a $9,600 prize pool for top submissions.
Text‑to‑speech (TTS) technology converts written text into natural‑sounding speech and has become a core component of intelligent voice assistants, audio books, and information broadcasting. Recent advances in deep learning, end‑to‑end synthesis frameworks, and neural vocoders have dramatically improved the naturalness of generated speech.
However, most high‑quality TTS systems rely on large, single‑speaker corpora. When only a few utterances are available for a target speaker—especially in multi‑speaker, multi‑style scenarios—the quality, expressiveness, and robustness of the synthesized voice degrade. This challenge defines the problem as Multi‑Speaker Multi‑Style Voice Cloning (M2VoC), a low‑resource voice cloning task.
The M2VoC Challenge, organized jointly by iQIYI and several research institutions for ICASSP 2021, provides a common dataset and a fair evaluation platform to stimulate research on low‑resource voice cloning.
Tracks and Sub‑tracks
Track 1 – Few‑Shot Track: Participants receive data for two (validation) and four (final) speakers, each with 100 utterances. Sub‑track 1A restricts system training to the provided data only; Sub‑track 1B allows the use of any publicly available data, provided the sources are disclosed.
Track 2 – Extremely Few‑Shot Track: Similar setup but each speaker provides only 5 utterances. The same sub‑track rules (1A, 1B) apply.
Evaluation and Ranking
Submissions are judged subjectively by listening tests using the following MOS‑based criteria (5‑point scale):
Speaker similarity
Speech quality
Style / expressiveness
Pronunciation accuracy
The weighted sum of these scores determines the final ranking for each sub‑task.
Datasets
Four datasets are released at different stages:
MST (Multi‑Speaker Training) : Combines AIShell‑3 (≈85 h, 218 speakers) and MST‑Originbeat (one male, one female speaker) recorded in high‑fidelity studios.
TSV (Target Speaker Validation) : For each track, two validation speakers (100 samples for Track 1, 5 samples for Track 2) with diverse speaking styles.
TST (Target Speaker Test) : Four test speakers per track (100 samples for Track 1, 5 samples for Track 2) used for final evaluation.
TT (Test Text) : A list of sentences and paragraphs that participants must synthesize for the test speakers.
All audio is mono, 44.1 kHz, 16‑bit PCM, with corresponding transcripts, in Mandarin Chinese.
Timeline (AoE)
2020‑11‑27: Detailed participation guide released.
2020‑12‑04: Registration deadline; MST‑Originbeat and TSV released.
2021‑01‑08: TST released.
2021‑01‑13: TT released.
2021‑01‑15: Final synthesis submission deadline.
2021‑01‑29: Evaluation results announced.
2021‑02‑05: System description paper deadline.
2021‑02‑11: ICASSP paper submission deadline.
Prize Money
Total prize pool: USD 9,600 (provided by iQIYI). The top two teams in each sub‑track receive:
1st place: USD 1,500
2nd place: USD 800
Registration
Researchers from academia and industry can register at http://challenge.ai.iqiyi.com/M2Voc before 2020‑12‑04 (AoE). Teams must follow the competition rules posted on the website.
Organizers & Contact
Committee members include professors from Northwestern Polytechnical University, National University of Singapore, Tsinghua University, and senior managers from iQIYI, Origin AI, and Shellbeike. For questions, email [email protected] .
iQIYI Technical Product Team
The technical product team of iQIYI
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.