Advances in Voice Interaction: 360's Intelligent Dialogue System Architecture and Core Technologies
This article presents a comprehensive overview of 360's voice interaction platform, detailing dialogue system fundamentals, platform architecture, and core technologies such as semantic understanding, dialog management, and question answering, all driven by deep learning and multimodal innovations.
With the rapid development of voice interaction technology, dialogue systems have become increasingly mature, largely driven by deep learning techniques that leverage large-scale data for feature representation and response generation, enhancing user experience.
This article shares the practical implementation of voice interaction technology at 360, covering its deployment in products such as the 360 smart speaker, children’s smartwatch, and security software.
1. Fundamentals of Dialogue Systems
A typical dialogue system pipeline consists of Automatic Speech Recognition (ASR) to convert speech to text, Natural Language Understanding (NLU) to interpret intent and slots, Dialog Manager (DM) for state tracking and policy decision, and Natural Language Generation (NLG) plus Text‑to‑Speech (TTS) for output.
Dialogue systems are categorized into task‑oriented, QA‑type, and chit‑chat systems.
2. 360 Intelligent Voice Interaction Platform
The platform adopts a modular architecture that decouples business logic from the core engine, enabling rapid skill development (≈1 week for simple skills, ≈2 weeks for complex ones) and supporting 82 built‑in skills across multiple products.
Key innovation is the multimodal access layer that introduces “events” to handle non‑textual inputs such as camera‑based mask detection.
3. Core Technologies
3.1 Semantic Understanding
Task‑type NLU extracts domain, intent, and slots; challenges include error propagation, lack of external knowledge, and OOV handling. Solutions explored include rule‑based matching, generative n‑gram models, similarity‑based retrieval, and deep models such as SF‑ID (joint slot filling and intent detection) with attention, CRF, and slot‑gate mechanisms.
External knowledge vectors improve accuracy by ~15 %.
3.2 Dialogue Management
Implemented with Dialog State Tracking (DST) and Dialog Policy (DP). Two approaches are used: Frame‑Based (slot‑filling for task‑oriented dialogs) and FSM‑Based (finite‑state machines for scripted tasks). The system also maintains contextual memory across turns, enabling cross‑scenario information inheritance.
3.3 Question Answering (QA)
The QA pipeline consists of query preprocessing, coarse retrieval (keyword‑based via Elasticsearch and embedding‑based via Faiss), fine‑ranking with an LSTM‑DSSM model, and business‑logic filtering. The LSTM‑DSSM outperforms BERT in this scenario while being computationally cheaper.
Conclusion
The article introduced the basics and workflow of voice interaction systems, described the 360 intelligent voice platform architecture, and detailed core technologies including SF‑ID semantic understanding, dialog management strategies, and QA retrieval and ranking methods.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.