Artificial Intelligence 23 min read

Real‑Time Voice Dialogue: Practices, Challenges, and Duplex Conversation

This article presents a comprehensive overview of real‑time voice dialogue systems, covering the hotline robot architecture, unique challenges of spoken interactions, ASR‑robust SLU models, multimodal emotion detection, oral expression handling, and the design and benefits of duplex (full‑duplex) conversational frameworks.

DataFunTalk

Dec 5, 2021

Real‑Time Voice Dialogue: Practices, Challenges, and Duplex Conversation

The session, presented by Alibaba algorithm expert Chen Kehan, introduces the real‑time voice dialogue scenario, focusing on telephone‑based interactions and the "Hotline XiaoMi" robot that handles both inbound and outbound calls.

1. Voice Dialogue Robot – Hotline XiaoMi

Hotline XiaoMi is Alibaba's intelligent telephone客服 robot with two modes: answering user‑initiated calls and proactively reaching users via calls. It relies on speech‑driven dialogue to provide seamless customer service.

2. Challenges of Real‑Time Voice Dialogue

Colloquial expressions: users speak in long, discontinuous, and noisy (ASR‑error) utterances.

Multimodality: speech carries richer information (tone, emotion) than text.

Full‑duplex interaction: low latency and strong turn‑taking requirements.

These challenges differentiate voice dialogue from traditional text‑based IM systems.

3. From Text‑Driven to Speech‑Semantic‑Driven Dialogue

Traditional five‑stage dialogue pipelines (NLU → DM → NLG) are insufficient for spoken input because they ignore acoustic cues and ASR errors.

3.1 ASR‑Robust SLU

Four common ASR error types (homophones, similar pronunciations, pinyin truncation/concatenation, digit‑letter conversion) affect downstream tasks. Instead of a costly "correct‑then‑SLU" pipeline, a fault‑tolerant SLU model directly maps noisy ASR output to correct intents, using a pre‑trained model that jointly encodes pronunciation and semantics.

3.2 Speech Emotion Detection

Detecting user emotion from audio is crucial for客服 scenarios. Existing academic datasets are often acted‑out and do not reflect real‑world call center data, so a regression‑based labeling (positive/negative intensity) is adopted to improve data quality and enable multimodal training.

3.3 Oral‑Expression Handling

Oral expressions often contain redundant information. Two preprocessing strategies are used: (1) short‑sentence classification + pattern reasoning, and (2) BERT‑based summarization to extract concise intents.

4. Duplex (Full‑Duplex) Dialogue

Duplex dialogue is defined by three properties: exclusive (single‑channel call), continuous (non‑atomic turn‑taking), and non‑perfect‑information (uncertain when the other side finishes speaking). The goal is to minimize simultaneous silence and speech.

4.1 Representation

A domain‑specific language (DSL) encodes dialogue as state → event → action. This structured representation enables training data generation and visualization of complex turn‑taking patterns.

4.2 Capabilities

Shorter response latency: By "listening‑while‑thinking" and "thinking‑while‑speaking", response time drops to ~500‑700 ms, close to human turn‑taking.

Semantic interruption: The system decides whether a user’s interruption is relevant to the current task, avoiding naïve audio‑based cuts.

Interactive digit collection: Specialized handling of high‑precision numeric inputs (phone numbers, IDs) improves collection completeness compared with traditional keypad methods.

Simulation environment: An instruction‑level simulator reproduces duplex timing and interaction patterns, facilitating offline training of end‑to‑end models.

5. Summary

The talk highlighted three distinctive aspects of real‑time voice dialogue—colloquialism, multimodality, and duplexing—and presented concrete solutions: fault‑tolerant SLU, multimodal emotion detection, oral‑expression processing, and a full‑duplex conversational architecture that reduces latency, supports semantic interruptions, and enables robust digit collection.

For a live demo, callers can dial Alibaba’s official hotline.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Speech Recognition duplex conversation SLU ASR robustness multimodal NLP real-time voice dialogue speech emotion detection

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.