Artificial Intelligence 15 min read

How to Achieve High‑Quality TTS with Only Minutes of Data

This article reviews neural speech synthesis, explains why large high‑quality paired data are essential, and presents a range of low‑resource solutions—including semi‑supervised pre‑training, cross‑language transfer, speaker embedding, and Conformer‑based model upgrades—demonstrating how the Zuoyebang team built a robust TTS system with as little as 7‑minute speaker recordings.

Zuoyebang Tech Team

May 19, 2022

How to Achieve High‑Quality TTS with Only Minutes of Data

Background Introduction

Speech synthesis converts text into audible audio. Traditional methods fall into waveform concatenation and statistical parametric approaches. With deep learning, end‑to‑end neural TTS has become the research hotspot because it simplifies pipelines and yields more natural, expressive speech.

Neural TTS, however, requires large amounts of high‑quality paired text‑audio data; building a good model often needs 10 hours or more of recordings. In many scenarios, collecting such data is impractical.

Small‑Data Speech Synthesis Techniques

We categorize data situations into <text, audio> mismatched and <text, audio> matched and discuss solutions for each.

<text, audio> Mismatched Two main approaches are used:

Semi‑supervised pre‑training: pre‑train the text encoder with large text corpora (e.g., BERT) and the spectrogram decoder with vector‑quantized representations, then fine‑tune on a small matched set.

Dual learning with ASR and TTS: use an ASR model to generate transcripts for unlabelled audio and a TTS model to synthesize audio for unpaired text, iteratively improving both.

<text, audio> Matched For limited paired data, two strategies are common:

Cross‑language pre‑training: leverage abundant data from other languages, using shared phoneme sets (e.g., IPA) or byte‑level representations, then fine‑tune on the low‑resource language.

Multi‑speaker transfer: use high‑resource speakers to boost low‑resource speaker quality via voice conversion or speaker‑aware training.

Zuoyebang’s Solution

We adopt Fastspeech2 + MultibandMelGAN as the backbone and introduce a multi‑speaker strategy for limited data.

Fastspeech2 uses a non‑autoregressive architecture, incorporating pitch and energy features to improve acoustic modeling. MultibandMelGAN serves as a vocoder with multi‑band, multi‑scale generation, offering high audio fidelity and real‑time inference on CPUs.

Speaker Embedding and Model Optimizations

We add a Speaker Embedding layer to the acoustic model, allowing automatic optimization during training. For speakers with less than one hour of data, performance remains limited, so we adopt the M2VoC winning approach: replace multiple speaker representations with a single D‑VECTOR derived from ECAPA‑TDNN, yielding better results.

Conformer Exploration

We replace the original Transformer encoder in Fastspeech2 with a Conformer, which better captures local and global acoustic patterns, and we convert all LayerNorm layers to ConditionLayerNorm to inject speaker information, reducing memory usage during deployment.

Business Benefits

Our optimized system can faithfully reproduce a 30‑minute speaker’s voice and mitigates issues such as abnormal pauses and volume inconsistencies. Experiments on a 751‑speaker, 600‑hour dataset show that our approach outperforms baseline models in MOS for both 30‑minute and 7‑minute speaker data.

Outlook

While progress has been made for low‑resource speakers, extreme‑low‑resource scenarios (a few seconds of audio) still pose challenges. Future work includes expanding high‑quality data, improving data selection, devising more efficient modeling strategies, and exploring vocoder enhancements.

References

[1] Xu Tan et al., “A Survey on Neural Speech Synthesis,” arXiv:2106.15561, 2021.

[2] Wei Fang et al., “Towards Transfer Learning for End‑to‑End Speech Synthesis,” arXiv:1906.07307, 2019.

… (remaining references omitted for brevity) …

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

neural TTS Speech synthesis low-resource TTS Fastspeech2 Conformer speaker embedding

Written by

Zuoyebang Tech Team

Sharing technical practices from Zuoyebang

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.