Design and Implementation of a Dialogue Management System for Intelligent Voice Robots
This article presents a comprehensive overview of an intelligent voice robot's dialogue management system, detailing its architecture, natural language understanding components, dialogue manager design, strategy handling, and workflow processes to achieve fluent multi‑turn interactions in telephone scenarios.
The intelligent voice robot, developed by 58 Group's TEG AI Lab, supports automatic dialing, multi‑turn voice interaction, and intent recognition, but faces challenges in maintaining professional and fluid conversations across diverse scripted dialogues.
The system's core modules include a voice call module, intent recognition, and a dialogue management system; this article focuses on the latter.
In task‑oriented dialogues, a dialogue management system combines intent and slot recognition results with system configurations—often implemented as finite state machines—to generate appropriate responses, commonly used in scenarios like ticket booking and weather queries.
Unlike passive chatbots, the voice robot initiates calls with predefined scripts for sales, notifications, or follow‑ups, requiring high fluency to handle user utterances and unexpected questions.
Scripts are organized into main, branch, generic, and standard question flows, guiding the robot's behavior during calls.
The dialogue management system consists of Natural Language Understanding (NLU) and a Dialogue Manager (DM). NLU processes streamed speech (after VAD and ASR) to extract user intent, slot information, sound type, and matches standard questions using models such as VGGish+BiLSTM for sound type classification and BiLSTM‑DSSM for semantic matching.
NLU components include:
Single‑sentence intent recognition covering 19 fixed labels (e.g., affirmation, request, hang‑up).
Sound type recognition to filter out unclear, blank, machine, or noisy audio, improving intent accuracy.
Standard question matching using a BiLSTM‑DSSM model to handle out‑of‑script user queries.
Slot filling implemented with a BiLSTM‑CRF model for extracting entities like time or location.
The Dialogue Manager maintains a dialog context (history) and a Dialog Policy Manager (DPM) that orders multiple dialog policies by priority. Policies process user actions to generate system actions, selecting appropriate script responses based on action type, user intent, or semantic similarity.
Typical policies include:
General intent policy for common intents such as hang‑up or greeting.
Standard question policy that retrieves answers via semantic similarity thresholds.
Main‑line policy that follows primary script branches, using intent matching and similarity scoring to choose the correct response.
The workflow involves saving each interaction (user, system, or trigger actions) into the dialog context, evaluating policies sequentially, and executing the matched policy's answer function to produce a system action, which is then sent to the user via SIP.
Examples of processing flows for general intents, standard questions, and main‑line scenarios illustrate how the system ensures smooth conversations despite low‑quality telephone audio.
In summary, the dialogue management system provides a generic engine for complex business scenarios, achieving low‑cost, multi‑scenario support while acknowledging challenges in speech quality and recognition that will be addressed in future work.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.