Real‑Time Voice Dialogue: Practices, Challenges, and Duplex Conversation
This article presents a comprehensive overview of real‑time voice dialogue systems, covering the hotline robot architecture, unique challenges of spoken interactions, ASR‑robust SLU models, multimodal emotion detection, oral expression handling, and the design and benefits of duplex (full‑duplex) conversational frameworks.
The session, presented by Alibaba algorithm expert Chen Kehan, introduces the real‑time voice dialogue scenario, focusing on telephone‑based interactions and the "Hotline XiaoMi" robot that handles both inbound and outbound calls.
1. Voice Dialogue Robot – Hotline XiaoMi
Hotline XiaoMi is Alibaba's intelligent telephone客服 robot with two modes: answering user‑initiated calls and proactively reaching users via calls. It relies on speech‑driven dialogue to provide seamless customer service.
2. Challenges of Real‑Time Voice Dialogue
Colloquial expressions: users speak in long, discontinuous, and noisy (ASR‑error) utterances.
Multimodality: speech carries richer information (tone, emotion) than text.
Full‑duplex interaction: low latency and strong turn‑taking requirements.
These challenges differentiate voice dialogue from traditional text‑based IM systems.
3. From Text‑Driven to Speech‑Semantic‑Driven Dialogue
Traditional five‑stage dialogue pipelines (NLU → DM → NLG) are insufficient for spoken input because they ignore acoustic cues and ASR errors.
3.1 ASR‑Robust SLU
Four common ASR error types (homophones, similar pronunciations, pinyin truncation/concatenation, digit‑letter conversion) affect downstream tasks. Instead of a costly "correct‑then‑SLU" pipeline, a fault‑tolerant SLU model directly maps noisy ASR output to correct intents, using a pre‑trained model that jointly encodes pronunciation and semantics.
3.2 Speech Emotion Detection
Detecting user emotion from audio is crucial for客服 scenarios. Existing academic datasets are often acted‑out and do not reflect real‑world call center data, so a regression‑based labeling (positive/negative intensity) is adopted to improve data quality and enable multimodal training.
3.3 Oral‑Expression Handling
Oral expressions often contain redundant information. Two preprocessing strategies are used: (1) short‑sentence classification + pattern reasoning, and (2) BERT‑based summarization to extract concise intents.
4. Duplex (Full‑Duplex) Dialogue
Duplex dialogue is defined by three properties: exclusive (single‑channel call), continuous (non‑atomic turn‑taking), and non‑perfect‑information (uncertain when the other side finishes speaking). The goal is to minimize simultaneous silence and speech.
4.1 Representation
A domain‑specific language (DSL) encodes dialogue as state → event → action . This structured representation enables training data generation and visualization of complex turn‑taking patterns.
4.2 Capabilities
Shorter response latency: By "listening‑while‑thinking" and "thinking‑while‑speaking", response time drops to ~500‑700 ms, close to human turn‑taking.
Semantic interruption: The system decides whether a user’s interruption is relevant to the current task, avoiding naïve audio‑based cuts.
Interactive digit collection: Specialized handling of high‑precision numeric inputs (phone numbers, IDs) improves collection completeness compared with traditional keypad methods.
Simulation environment: An instruction‑level simulator reproduces duplex timing and interaction patterns, facilitating offline training of end‑to‑end models.
5. Summary
The talk highlighted three distinctive aspects of real‑time voice dialogue—colloquialism, multimodality, and duplexing—and presented concrete solutions: fault‑tolerant SLU, multimodal emotion detection, oral‑expression processing, and a full‑duplex conversational architecture that reduces latency, supports semantic interruptions, and enables robust digit collection.
For a live demo, callers can dial Alibaba’s official hotline.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.