Detecting Invalid Queries in Voice Interaction: Non‑Human Interaction and Ambiguous Intent Recognition
This talk presents a comprehensive study of invalid query detection in voice assistants, covering the definition of effective and ineffective queries, challenges of non‑human interaction and ambiguous intent recognition, data collection, model design, experimental results, user‑feedback loops, and future research directions.
Speaker: Cui Shiqi, Head of Continuous Dialogue Algorithm at Xiaomi AI.
Introduction: The presentation focuses on recognizing invalid queries in voice interaction systems, which are divided into effective queries (fulfillable) and invalid queries (unfulfillable).
1. Invalid Query Overview
Invalid queries are categorized into two main types: non‑human interaction queries (background speech captured by the device) and ambiguous‑intent queries (the system cannot determine the user’s intention).
1.1 Effective Query Types
Single‑turn clear intent: the current utterance directly reveals the user’s goal (e.g., "turn on the living‑room AC").
Context‑clear intent: the surrounding context (scene, device state, user history) disambiguates the intent.
Multiple intents: the utterance matches several possible intents without sufficient context.
1.2 Invalid Query Types
Non‑human interaction: speech from by‑standers or ambient noise mistakenly captured as a command.
Ambiguous intent: includes (a) scrambled/no‑meaning utterances caused by ASR errors, (b) incomplete expressions due to truncation, and (c) vague queries lacking clear intent words.
These invalid queries account for 5%–20% of all requests across Xiaomi devices, making their mitigation important.
2. Non‑Human Interaction Recognition
2.1 Challenges
Information‑incomplete ML task: requires acoustic, prosodic, and possibly visual cues, but most systems rely only on audio.
Variability of speech: tone, speed, background noise, and speaker differences hinder generalization.
Cocktail‑party effect: distinguishing the target speaker in noisy environments is difficult for machines.
2.2 Problem Modeling
Pure NLU approaches using only the query text perform poorly; audio signals are essential. Adding previous turn information yields little gain because non‑human queries are largely independent across turns.
2.3 Solution
The task is framed as a single‑turn binary classification problem. Key steps include dataset construction and feature/model selection.
2.3.1 Dataset Construction
High annotation cost: audio labeling requires skilled annotators; building a 100k‑sample set needs ~100 person‑days.
Quality improvement: detailed labeling guidelines, multi‑annotator verification, and iterative data cleaning raise label consistency.
Sample mining: random sampling for diversity, targeted mining for positive (non‑human) samples using ASR confidence or wake‑word detection, and hard‑sample mining via model scores and user feedback.
2.3.2 Model Architecture
A deep neural network combines four feature groups:
Speech features: raw audio processed into spectrograms (frame, window, FFT) fed to a CNN‑LSTM‑Attention encoder.
Text features: word embeddings of the ASR transcript encoded by a TextCNN.
High‑level acoustic features: ASR confidence scores.
High‑level semantic features: NLU‑extracted domain intents and slots.
The speech encoder uses multi‑layer CNNs, followed by LSTM and Attention; the text encoder uses TextCNN. Outputs are concatenated with high‑level features and passed to separate classification heads for first‑turn and non‑first‑turn queries.
2.4 Experimental Findings
Spectrogram features outperform MFCC/fbank for this task.
Deeper CNN kernels improve robustness; adding LSTM+Attention yields further gains.
TextCNN performs on par with Transformer/BERT for this classification.
Simple concatenation of speech and text embeddings outperforms attention‑based fusion.
3. Ambiguous‑Intent Recognition
3.1 Types of Ambiguity
Scrambled/no‑meaning queries: high perplexity sentences detected via language‑model perplexity.
Incomplete expressions: sentences cut off or missing final tokens.
Vague intent: queries lacking clear intent words.
3.2 Scrambled Query Detection
Perplexity is computed using large‑scale language models (LSTM, BERT, or GPT‑style autoregressive models). BERT‑based perplexity is accurate but computationally heavy; GPT‑style models reduce inference cost while maintaining performance.
3.3 Incomplete Expression Detection
Two approaches are explored:
Predict the next token as an end‑of‑sentence marker using a language model.
Binary classification using query text and auxiliary features (e.g., bigram, mutual information) to handle colloquial and truncated utterances.
Both single‑turn and multi‑turn models are employed: the single‑turn model judges completeness directly, while the multi‑turn model re‑evaluates using the previous turn context.
4. User Feedback Integration
User feedback is leveraged in two ways:
Online: mis‑rejection feedback (user repeats the command) triggers dynamic strategy adjustment.
Offline: both mis‑rejection and false‑accept feedback are mined to retrain models.
Personalized strategies consider historical user behavior to reduce unnecessary rejections.
5. Evaluation and Current Capability
Non‑human interaction detection achieves ~90% precision and 70%–80% recall, still below the desired 90%/90% balance. Human annotators show similar variability, indicating the intrinsic difficulty of the task.
Conclusion
The talk summarized two core ML tasks in voice assistants: non‑human interaction detection and ambiguous‑intent detection, describing data pipelines, model designs, experimental results, and future research directions.
Images referenced in the original slides are retained below:
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.