How Huolala’s In‑House TTS Overcomes Latency, Naturalness, and Multilingual Limits
This article details Huolala’s self‑developed Text‑to‑Speech system, outlining its architecture, the challenges of latency, naturalness, and language support, and the innovative solutions—including streaming synthesis, emotion modeling, and transfer‑learning‑based multilingual capabilities—that deliver more flexible and realistic voice interactions.
Background
TTS (Text‑to‑Speech) converts text into spoken output and is used by Huolala primarily in intelligent customer service and phone notification scenarios. Real‑time voice feedback improves user experience, while offline synthesis enables diverse, dynamic scripts, surpassing traditional pre‑recorded approaches.
Problems
High latency makes real‑time communication difficult.
Insufficient naturalness and emotional expression results in mechanical‑sounding speech.
Limited multilingual support hampers seamless language switching.
Solutions
Latency : Develop streaming TTS for real‑time voice output.
Naturalness : Train with mixed voice tones to improve realism.
Emotion : Introduce emotion modeling for more authentic speech.
Cross‑language : Use transfer learning to share acoustic features, ensuring smooth multilingual synthesis with consistent quality.
System Framework
The TTS system is divided into four layers: infrastructure (data storage, stability), platform (speech algorithm execution), application (scenario adaptation, security), and business (AI outbound calls, intelligent客服).
Algorithm Overview
Mainstream TTS solutions rely on stable deep‑learning models such as Baidu PaddleSpeech, Google Tacotron series, and Microsoft FastSpeech series, delivering high‑quality, natural speech for content creation, education, and客服.
Huolala’s self‑developed solution supports both streaming and non‑streaming synthesis, built on VITS2 with an optimized decoder for streaming. It also incorporates a Bert‑VITS2‑style text analysis model to enhance naturalness and expressiveness.
Text Encoder
The text encoder converts raw text into model‑readable features. Optimizations include:
Text normalization : Clean and standardize input.
Phoneme extraction : Accurately capture phonemes in Chinese context.
Prosody optimization : Dynamically adjust intonation, pauses, and stress.
Semantic enhancement : Extract semantic features to boost expressive power.
Decoder
The decoder transforms features from the text encoder into audio signals. For streaming, the VITS2 decoder processes audio features in blocks, decoding each block in real time and applying overlap‑add to ensure smooth transitions.
Emotion Modeling
Emotion features are introduced using a CLAP‑based emotion classification model that extracts emotional embeddings from text, enhancing the naturalness and expressiveness of synthesized speech.
Voice Customization
Custom voice creation is achieved through:
Few‑shot training : Quickly fine‑tune specific voice timbres with limited data.
Efficient transfer learning : Light‑weight adjustments adapt the model to new voice styles.
Cross‑Language Transfer Learning
To support multilingual synthesis, transfer learning shares acoustic features across languages, reducing training cost and enabling seamless language switching. A multilingual pre‑trained BERT model ensures consistent quality.
Results
Comparisons between Huolala’s self‑developed TTS and third‑party solutions show that the former delivers more natural, human‑like speech, especially in emotion and real‑time scenarios.
Conclusion and Outlook
The article presents Huolala’s TTS advancements—streaming synthesis, emotion expression, multilingual support, and voice customization—aimed at delivering flexible, real‑time, and natural voice interaction. Future work will continue to innovate TTS technology to enrich Huolala’s ecosystem with smarter audio applications.
References
[1] Wu, Yusong, et al. "Large‑scale contrastive language‑audio pretraining with feature fusion and keyword‑to‑caption augmentation." ICASSP 2023.
[2] Kong, Jungil, et al. "VITS2: Improving quality and efficiency of single‑stage text‑to‑speech with adversarial learning and architecture design." arXiv preprint arXiv:2307.16430 (2023).
[3] Ren, Yi, et al. "Fastspeech 2: Fast and high‑quality end‑to‑end text to speech." arXiv preprint arXiv:2006.04558 (2020).
[4] Tan, Xu, et al. "A survey on neural speech synthesis." arXiv preprint arXiv:2106.15561 (2021).
[5] Zhang, Hui, et al. "Paddlespeech: An easy‑to‑use all‑in‑one speech toolkit." arXiv preprint arXiv:2205.12007 (2022).
[6] Devlin, Jacob. "Bert: Pre‑training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).
[7] Mohamed, Abdelrahman, et al. "Self‑supervised speech representation learning: A review." IEEE Journal of Selected Topics in Signal Processing 16.6 (2022): 1179‑1210.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
