How Huolala’s In‑House TTS Overcomes Latency, Naturalness, and Multilingual Limits

This article details Huolala’s self‑developed Text‑to‑Speech system, outlining its architecture, the challenges of latency, naturalness, and language support, and the innovative solutions—including streaming synthesis, emotion modeling, and transfer‑learning‑based multilingual capabilities—that deliver more flexible and realistic voice interactions.

Huolala Tech
Huolala Tech
Huolala Tech
How Huolala’s In‑House TTS Overcomes Latency, Naturalness, and Multilingual Limits

Background

TTS (Text‑to‑Speech) converts text into spoken output and is used by Huolala primarily in intelligent customer service and phone notification scenarios. Real‑time voice feedback improves user experience, while offline synthesis enables diverse, dynamic scripts, surpassing traditional pre‑recorded approaches.

Problems

High latency makes real‑time communication difficult.

Insufficient naturalness and emotional expression results in mechanical‑sounding speech.

Limited multilingual support hampers seamless language switching.

Solutions

Latency : Develop streaming TTS for real‑time voice output.

Naturalness : Train with mixed voice tones to improve realism.

Emotion : Introduce emotion modeling for more authentic speech.

Cross‑language : Use transfer learning to share acoustic features, ensuring smooth multilingual synthesis with consistent quality.

System Framework

The TTS system is divided into four layers: infrastructure (data storage, stability), platform (speech algorithm execution), application (scenario adaptation, security), and business (AI outbound calls, intelligent客服).

System architecture
System architecture

Algorithm Overview

Mainstream TTS solutions rely on stable deep‑learning models such as Baidu PaddleSpeech, Google Tacotron series, and Microsoft FastSpeech series, delivering high‑quality, natural speech for content creation, education, and客服.

Huolala’s self‑developed solution supports both streaming and non‑streaming synthesis, built on VITS2 with an optimized decoder for streaming. It also incorporates a Bert‑VITS2‑style text analysis model to enhance naturalness and expressiveness.

Mainstream TTS flow
Mainstream TTS flow
Huolala TTS flow
Huolala TTS flow

Text Encoder

The text encoder converts raw text into model‑readable features. Optimizations include:

Text normalization : Clean and standardize input.

Phoneme extraction : Accurately capture phonemes in Chinese context.

Prosody optimization : Dynamically adjust intonation, pauses, and stress.

Semantic enhancement : Extract semantic features to boost expressive power.

Text encoder architecture
Text encoder architecture

Decoder

The decoder transforms features from the text encoder into audio signals. For streaming, the VITS2 decoder processes audio features in blocks, decoding each block in real time and applying overlap‑add to ensure smooth transitions.

Decoder block processing
Decoder block processing

Emotion Modeling

Emotion features are introduced using a CLAP‑based emotion classification model that extracts emotional embeddings from text, enhancing the naturalness and expressiveness of synthesized speech.

Emotion tag extraction
Emotion tag extraction

Voice Customization

Custom voice creation is achieved through:

Few‑shot training : Quickly fine‑tune specific voice timbres with limited data.

Efficient transfer learning : Light‑weight adjustments adapt the model to new voice styles.

Voice customization flow
Voice customization flow

Cross‑Language Transfer Learning

To support multilingual synthesis, transfer learning shares acoustic features across languages, reducing training cost and enabling seamless language switching. A multilingual pre‑trained BERT model ensures consistent quality.

Multilingual implementation flow
Multilingual implementation flow

Results

Comparisons between Huolala’s self‑developed TTS and third‑party solutions show that the former delivers more natural, human‑like speech, especially in emotion and real‑time scenarios.

Conclusion and Outlook

The article presents Huolala’s TTS advancements—streaming synthesis, emotion expression, multilingual support, and voice customization—aimed at delivering flexible, real‑time, and natural voice interaction. Future work will continue to innovate TTS technology to enrich Huolala’s ecosystem with smarter audio applications.

References

[1] Wu, Yusong, et al. "Large‑scale contrastive language‑audio pretraining with feature fusion and keyword‑to‑caption augmentation." ICASSP 2023.

[2] Kong, Jungil, et al. "VITS2: Improving quality and efficiency of single‑stage text‑to‑speech with adversarial learning and architecture design." arXiv preprint arXiv:2307.16430 (2023).

[3] Ren, Yi, et al. "Fastspeech 2: Fast and high‑quality end‑to‑end text to speech." arXiv preprint arXiv:2006.04558 (2020).

[4] Tan, Xu, et al. "A survey on neural speech synthesis." arXiv preprint arXiv:2106.15561 (2021).

[5] Zhang, Hui, et al. "Paddlespeech: An easy‑to‑use all‑in‑one speech toolkit." arXiv preprint arXiv:2205.12007 (2022).

[6] Devlin, Jacob. "Bert: Pre‑training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).

[7] Mohamed, Abdelrahman, et al. "Self‑supervised speech representation learning: A review." IEEE Journal of Selected Topics in Signal Processing 16.6 (2022): 1179‑1210.

Artificial Intelligencemultilingualtext-to-speechEmotion ModelingStreaming TTSVoice Customization
Huolala Tech
Written by

Huolala Tech

Technology reshapes logistics

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.