Artificial Intelligence 10 min read

How Huolala’s In‑House TTS Overcomes Latency, Naturalness, and Multilingual Limits

This article details Huolala’s self‑developed Text‑to‑Speech system, outlining its architecture, the challenges of latency, naturalness, and language support, and the innovative solutions—including streaming synthesis, emotion modeling, and transfer‑learning‑based multilingual capabilities—that deliver more flexible and realistic voice interactions.

Huolala Tech

Dec 26, 2024

How Huolala’s In‑House TTS Overcomes Latency, Naturalness, and Multilingual Limits

Background

TTS (Text‑to‑Speech) converts text into spoken output and is used by Huolala primarily in intelligent customer service and phone notification scenarios. Real‑time voice feedback improves user experience, while offline synthesis enables diverse, dynamic scripts, surpassing traditional pre‑recorded approaches.

Problems

High latency makes real‑time communication difficult.

Insufficient naturalness and emotional expression results in mechanical‑sounding speech.

Limited multilingual support hampers seamless language switching.

Solutions

Latency : Develop streaming TTS for real‑time voice output.

Naturalness : Train with mixed voice tones to improve realism.

Emotion : Introduce emotion modeling for more authentic speech.

Cross‑language : Use transfer learning to share acoustic features, ensuring smooth multilingual synthesis with consistent quality.

System Framework

The TTS system is divided into four layers: infrastructure (data storage, stability), platform (speech algorithm execution), application (scenario adaptation, security), and business (AI outbound calls, intelligent客服).

Algorithm Overview

Mainstream TTS solutions rely on stable deep‑learning models such as Baidu PaddleSpeech, Google Tacotron series, and Microsoft FastSpeech series, delivering high‑quality, natural speech for content creation, education, and客服.

Huolala’s self‑developed solution supports both streaming and non‑streaming synthesis, built on VITS2 with an optimized decoder for streaming. It also incorporates a Bert‑VITS2‑style text analysis model to enhance naturalness and expressiveness.

Text Encoder

The text encoder converts raw text into model‑readable features. Optimizations include:

Text normalization : Clean and standardize input.

Phoneme extraction : Accurately capture phonemes in Chinese context.

Prosody optimization : Dynamically adjust intonation, pauses, and stress.

Semantic enhancement : Extract semantic features to boost expressive power.

Decoder

The decoder transforms features from the text encoder into audio signals. For streaming, the VITS2 decoder processes audio features in blocks, decoding each block in real time and applying overlap‑add to ensure smooth transitions.

Emotion Modeling

Emotion features are introduced using a CLAP‑based emotion classification model that extracts emotional embeddings from text, enhancing the naturalness and expressiveness of synthesized speech.

Voice Customization

Custom voice creation is achieved through:

Few‑shot training : Quickly fine‑tune specific voice timbres with limited data.

Efficient transfer learning : Light‑weight adjustments adapt the model to new voice styles.

Cross‑Language Transfer Learning

To support multilingual synthesis, transfer learning shares acoustic features across languages, reducing training cost and enabling seamless language switching. A multilingual pre‑trained BERT model ensures consistent quality.

Results

Comparisons between Huolala’s self‑developed TTS and third‑party solutions show that the former delivers more natural, human‑like speech, especially in emotion and real‑time scenarios.

Conclusion and Outlook

The article presents Huolala’s TTS advancements—streaming synthesis, emotion expression, multilingual support, and voice customization—aimed at delivering flexible, real‑time, and natural voice interaction. Future work will continue to innovate TTS technology to enrich Huolala’s ecosystem with smarter audio applications.

References

[1] Wu, Yusong, et al. "Large‑scale contrastive language‑audio pretraining with feature fusion and keyword‑to‑caption augmentation." ICASSP 2023.

[2] Kong, Jungil, et al. "VITS2: Improving quality and efficiency of single‑stage text‑to‑speech with adversarial learning and architecture design." arXiv preprint arXiv:2307.16430 (2023).

[3] Ren, Yi, et al. "Fastspeech 2: Fast and high‑quality end‑to‑end text to speech." arXiv preprint arXiv:2006.04558 (2020).

[4] Tan, Xu, et al. "A survey on neural speech synthesis." arXiv preprint arXiv:2106.15561 (2021).

[5] Zhang, Hui, et al. "Paddlespeech: An easy‑to‑use all‑in‑one speech toolkit." arXiv preprint arXiv:2205.12007 (2022).

[6] Devlin, Jacob. "Bert: Pre‑training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).

[7] Mohamed, Abdelrahman, et al. "Self‑supervised speech representation learning: A review." IEEE Journal of Selected Topics in Signal Processing 16.6 (2022): 1179‑1210.

Artificial Intelligence multilingual text-to-speech Emotion Modeling Streaming TTS Voice Customization

Written by

Huolala Tech

Technology reshapes logistics

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.