Artificial Intelligence 22 min read

Real‑Time Voice Dialogue: Practices, Challenges, and Duplex Conversation

This article presents an in‑depth overview of Alibaba's real‑time voice dialogue system, covering the Hotline XiaoMi robot, the unique challenges of spoken interactions such as colloquialism, multimodality and duplex communication, and the research advances in ASR‑robust SLU, emotion detection, colloquial processing, and duplex conversation modeling.

DataFunSummit

Dec 3, 2021

Real‑Time Voice Dialogue: Practices, Challenges, and Duplex Conversation

Introduction – The talk focuses on practical experiences with intelligent voice dialogue in real‑time telephone scenarios, where users interact with a robot called Hotline XiaoMi.

1. Voice Dialogue Robot: Hotline XiaoMi – Hotline XiaoMi handles both inbound calls (user‑initiated) and outbound calls (platform‑initiated) using speech‑based interaction, providing a multi‑turn, voice‑driven customer‑service experience.

2. Challenges of Real‑Time Voice Dialogue

Colloquialism – users speak long, unstructured sentences with ASR noise.

Multimodality – speech carries rich acoustic cues (tone, emotion, background) beyond text.

Duplex communication – low‑latency, strong interaction requiring micro‑turns.

3. From Text‑Driven to Voice‑Semantic Dialogue

3.1 ASR‑Robust SLU – A pre‑trained model encodes both pronunciation and semantics, enabling tolerance to common ASR errors (homophones, similar sounds, pinyin truncation, digit‑letter conversion). The model can be fine‑tuned on downstream tasks such as intent classification or order matching.

3.2 Voice Emotion Detection – Multimodal training combines audio and text to recognize negative emotions in customer calls, addressing the scarcity and low quality of existing academic speech‑emotion datasets.

3.3 Colloquial Expression Handling – Two strategies are used: (1) short‑sentence classification + pattern reasoning, and (2) BERT‑based summarization to compress long, redundant utterances.

4. Duplex Conversation

4.1 Definition – Duplex dialogue is characterized by exclusive, continuous, non‑atomic interaction where both parties may speak and listen simultaneously, requiring prediction of turn‑taking.

4.2 Duplex DM (Decision‑Making) – A micro‑turn detector feeds state and event information to a Duplex‑DM module, which decides actions such as waiting, invoking NLU, or producing task‑free chat responses.

4.3 Benefits

Shorter response latency – reduces reply time from ~1 s to < 500 ms by “listen‑while‑thinking” and “think‑while‑speaking”.

Semantic interruption – decides whether a user’s utterance truly intends to interrupt the current turn.

Interactive digit collection – handles high‑frequency, error‑prone numeric inputs (phone numbers, IDs) with higher completion rates.

Simulation environment – a command‑level simulator reproduces duplex timing and interaction for offline training.

Conclusion – The work progresses from a traditional text‑driven dialogue system to a voice‑semantic, duplex‑enabled conversation platform, integrating ASR‑robust SLU, multimodal emotion detection, colloquial processing, and a dedicated duplex decision module.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Multimodal speech AI ASR voice dialogue duplex conversation SLU

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.