Multi‑Turn Voice Bot Architecture and End‑to‑End Dialogue Jump Strategies at 58.com
This article describes the overall architecture of 58.com’s multi‑turn voice robot, explains rule‑based, intent‑based and text‑matching dialogue jump strategies, introduces an end‑to‑end classification approach using TextCNN, and reports its online performance improvements and future research directions.
The voice robot combines speech recognition, semantic understanding, and speech synthesis to enable multi‑turn conversational interactions, reducing manual workload in scenarios such as product marketing, service notifications, and user surveys.
58.com’s platform connects millions of C‑end users with B‑end merchants; phone calls are a key communication channel, prompting the development of a voice robot to automate repetitive outreach and improve sales efficiency.
Overall Architecture – The system is divided into five layers: Application (outbound calls, intelligent call assistant, video interview), Access (outbound, inbound, transfer, ring‑back), Logic (core dialogue control), Editing/Operations (data labeling for model iteration), and Infrastructure (SIP resources, network audio, ASR engine, TTS service).
Interaction Flow – User audio is segmented (VAD), transcribed (ASR), then processed for voice type, intent, and slot extraction; a dialogue manager selects a response strategy and generates synthesized speech.
Dialogue Components – Script : a directed‑graph of dialogue nodes (mainline, generic, keyword Q&A, standard questions). Semantic Understanding : single‑sentence intent classification (19 categories), slot extraction via entity recognition, and voice‑type classification using VGGish. Dialogue Manager : a prioritized chain of strategies (rule, intent, similarity) that determines the next node.
Evaluation Metric – Response Accuracy measures the correctness of the robot’s action for each user reply, providing a holistic assessment of the entire model‑plus‑strategy pipeline.
Rule‑Based, Intent‑Based, and Text‑Matching Strategies – Rules use regex for node jumps; intent‑based jumps rely on classified user intent; text‑matching compares user utterances with predefined question‑answer pairs. Advantages include configurability and cold‑start capability; drawbacks are incomplete coverage, coarse granularity, difficulty handling long‑sentence similarity, and lack of learning.
End‑to‑End Jump Strategy – Treats each turn as a classification task: concatenated robot and user texts are fed into a TextCNN model that directly predicts the next node. Experiments show TextCNN outperforms LSTM and Transformer in noisy ASR conditions.
Model Training & Data Augmentation – Augmentation techniques include random replacement of keyword‑Q&A prompts, rule‑generated industry‑specific utterances, manually collected data, and automated clustering of dialogue logs to generate synthetic samples.
Transferability & Online Results – A single model can serve multiple script versions; online A/B tests reveal the end‑to‑end approach improves overall response accuracy by 3.38 % and long‑sentence accuracy by 14.98 %, as well as raising the compliant outbound conversion rate by 12.95 %.
Future Directions – Integrate a universal intent module, handle multi‑intent detection, incorporate historical state for script jumps, explore Next Sentence Prediction, and apply reinforcement learning for automated strategy optimization.
Application – Sales Intelligent Outbound Assistant – In the “Michigan” sales workflow, the voice robot automates the first contact call, filters qualified leads, and feeds them to human sales teams, achieving higher compliance and efficiency.
Conclusion – The paper presents the transition from traditional rule‑based dialogue management to an end‑to‑end, data‑driven approach, demonstrating significant performance gains in real‑world multi‑turn voice interactions.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.