Real‑Time Voice Dialogue: Practices, Challenges, and Duplex Conversation

This article presents an in‑depth overview of Alibaba's real‑time voice dialogue system, covering the Hotline XiaoMi robot, the unique challenges of spoken interactions such as colloquialism, multimodality and duplex communication, and the research advances in ASR‑robust SLU, emotion detection, colloquial processing, and duplex conversation modeling.

DataFunSummit
DataFunSummit
DataFunSummit
Real‑Time Voice Dialogue: Practices, Challenges, and Duplex Conversation

Introduction – The talk focuses on practical experiences with intelligent voice dialogue in real‑time telephone scenarios, where users interact with a robot called Hotline XiaoMi.

1. Voice Dialogue Robot: Hotline XiaoMi – Hotline XiaoMi handles both inbound calls (user‑initiated) and outbound calls (platform‑initiated) using speech‑based interaction, providing a multi‑turn, voice‑driven customer‑service experience.

Hotline XiaoMi illustration
Hotline XiaoMi illustration

2. Challenges of Real‑Time Voice Dialogue

Colloquialism – users speak long, unstructured sentences with ASR noise.

Multimodality – speech carries rich acoustic cues (tone, emotion, background) beyond text.

Duplex communication – low‑latency, strong interaction requiring micro‑turns.

Voice dialogue challenges
Voice dialogue challenges

3. From Text‑Driven to Voice‑Semantic Dialogue

3.1 ASR‑Robust SLU – A pre‑trained model encodes both pronunciation and semantics, enabling tolerance to common ASR errors (homophones, similar sounds, pinyin truncation, digit‑letter conversion). The model can be fine‑tuned on downstream tasks such as intent classification or order matching.

ASR‑Robust SLU comparison
ASR‑Robust SLU comparison

3.2 Voice Emotion Detection – Multimodal training combines audio and text to recognize negative emotions in customer calls, addressing the scarcity and low quality of existing academic speech‑emotion datasets.

Emotion detection pipeline
Emotion detection pipeline

3.3 Colloquial Expression Handling – Two strategies are used: (1) short‑sentence classification + pattern reasoning, and (2) BERT‑based summarization to compress long, redundant utterances.

Colloquial processing
Colloquial processing

4. Duplex Conversation

4.1 Definition – Duplex dialogue is characterized by exclusive, continuous, non‑atomic interaction where both parties may speak and listen simultaneously, requiring prediction of turn‑taking.

Duplex vs. synchronous vs. asynchronous
Duplex vs. synchronous vs. asynchronous

4.2 Duplex DM (Decision‑Making) – A micro‑turn detector feeds state and event information to a Duplex‑DM module, which decides actions such as waiting, invoking NLU, or producing task‑free chat responses.

Duplex DM architecture
Duplex DM architecture

4.3 Benefits

Shorter response latency – reduces reply time from ~1 s to < 500 ms by “listen‑while‑thinking” and “think‑while‑speaking”.

Semantic interruption – decides whether a user’s utterance truly intends to interrupt the current turn.

Interactive digit collection – handles high‑frequency, error‑prone numeric inputs (phone numbers, IDs) with higher completion rates.

Simulation environment – a command‑level simulator reproduces duplex timing and interaction for offline training.

Response latency demo
Response latency demo

Conclusion – The work progresses from a traditional text‑driven dialogue system to a voice‑semantic, duplex‑enabled conversation platform, integrating ASR‑robust SLU, multimodal emotion detection, colloquial processing, and a dedicated duplex decision module.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

multimodalSpeech AIASRvoice dialogueduplex conversationSLU
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.