How to Achieve High‑Quality TTS with Only Minutes of Data
This article reviews neural speech synthesis, explains why large high‑quality paired data are essential, and presents a range of low‑resource solutions—including semi‑supervised pre‑training, cross‑language transfer, speaker embedding, and Conformer‑based model upgrades—demonstrating how the Zuoyebang team built a robust TTS system with as little as 7‑minute speaker recordings.
Background Introduction
Speech synthesis converts text into audible audio. Traditional methods fall into waveform concatenation and statistical parametric approaches. With deep learning, end‑to‑end neural TTS has become the research hotspot because it simplifies pipelines and yields more natural, expressive speech.
Neural TTS, however, requires large amounts of high‑quality paired text‑audio data; building a good model often needs 10 hours or more of recordings. In many scenarios, collecting such data is impractical.
Small‑Data Speech Synthesis Techniques
We categorize data situations into <text, audio> mismatched and <text, audio> matched and discuss solutions for each.
<text, audio> Mismatched Two main approaches are used:
Semi‑supervised pre‑training: pre‑train the text encoder with large text corpora (e.g., BERT) and the spectrogram decoder with vector‑quantized representations, then fine‑tune on a small matched set.
Dual learning with ASR and TTS: use an ASR model to generate transcripts for unlabelled audio and a TTS model to synthesize audio for unpaired text, iteratively improving both.
<text, audio> Matched For limited paired data, two strategies are common:
Cross‑language pre‑training: leverage abundant data from other languages, using shared phoneme sets (e.g., IPA) or byte‑level representations, then fine‑tune on the low‑resource language.
Multi‑speaker transfer: use high‑resource speakers to boost low‑resource speaker quality via voice conversion or speaker‑aware training.
Zuoyebang’s Solution
We adopt Fastspeech2 + MultibandMelGAN as the backbone and introduce a multi‑speaker strategy for limited data.
Fastspeech2 uses a non‑autoregressive architecture, incorporating pitch and energy features to improve acoustic modeling. MultibandMelGAN serves as a vocoder with multi‑band, multi‑scale generation, offering high audio fidelity and real‑time inference on CPUs.
Speaker Embedding and Model Optimizations
We add a Speaker Embedding layer to the acoustic model, allowing automatic optimization during training. For speakers with less than one hour of data, performance remains limited, so we adopt the M2VoC winning approach: replace multiple speaker representations with a single D‑VECTOR derived from ECAPA‑TDNN, yielding better results.
Conformer Exploration
We replace the original Transformer encoder in Fastspeech2 with a Conformer, which better captures local and global acoustic patterns, and we convert all LayerNorm layers to ConditionLayerNorm to inject speaker information, reducing memory usage during deployment.
Business Benefits
Our optimized system can faithfully reproduce a 30‑minute speaker’s voice and mitigates issues such as abnormal pauses and volume inconsistencies. Experiments on a 751‑speaker, 600‑hour dataset show that our approach outperforms baseline models in MOS for both 30‑minute and 7‑minute speaker data.
Outlook
While progress has been made for low‑resource speakers, extreme‑low‑resource scenarios (a few seconds of audio) still pose challenges. Future work includes expanding high‑quality data, improving data selection, devising more efficient modeling strategies, and exploring vocoder enhancements.
References
[1] Xu Tan et al., “A Survey on Neural Speech Synthesis,” arXiv:2106.15561, 2021.
[2] Wei Fang et al., “Towards Transfer Learning for End‑to‑End Speech Synthesis,” arXiv:1906.07307, 2019.
… (remaining references omitted for brevity) …
Zuoyebang Tech Team
Sharing technical practices from Zuoyebang
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.