Artificial Intelligence 17 min read

Detecting Invalid Queries in Voice Interaction: Non‑Human Interaction and Ambiguous Intent Recognition

This talk presents a comprehensive study of invalid query detection in voice assistants, covering the definition and taxonomy of invalid queries, challenges of non‑human interaction and ambiguous intent recognition, data collection and labeling strategies, feature engineering, deep neural network modeling, experimental results, user‑feedback loops, and current performance limits.

DataFunTalk

Mar 20, 2022

Detecting Invalid Queries in Voice Interaction: Non‑Human Interaction and Ambiguous Intent Recognition

The presentation introduces the problem of invalid queries in voice interaction systems, distinguishing between effective queries that can be satisfied and invalid queries that cannot, which include non‑human interaction queries and ambiguous‑intent queries.

Effective queries are further categorized into single‑turn clear intent, scene‑aware intent, and multi‑intent cases, while invalid queries are split into non‑human interaction (e.g., background speech captured by always‑on microphones) and ambiguous intent (e.g., scrambled, incomplete, or vague utterances).

Statistics from Xiaomi’s devices show that non‑human interaction and ambiguous intent together account for 5%‑20% of all requests, highlighting the importance of improving the user experience for these cases.

For non‑human interaction detection, the task is framed as a single‑turn binary classification problem. Challenges include incomplete information, diverse acoustic variations, and the cocktail‑party effect. Modeling experiments reveal that relying solely on text is insufficient; acoustic features are crucial.

Data collection is costly because it requires manual audio annotation. Quality is improved through detailed labeling guidelines, multi‑annotator verification, and iterative data refinement. Sample mining strategies focus on increasing diversity (random sampling) and effectiveness (targeted mining of rare non‑human samples using ASR confidence or wake‑word detection).

The final model combines four feature groups: raw acoustic spectrograms processed by a CNN‑LSTM‑Attention encoder, text embeddings from the ASR transcript encoded by a TextCNN, and two high‑level features (ASR confidence scores and NLU‑derived intent/slot information). The fused representation feeds a classification layer with separate heads for first‑turn and subsequent‑turn queries.

Extensive ablation studies show that raw spectrograms outperform MFCC/fbank, deeper CNN kernels improve robustness, and simple concatenation of acoustic and textual embeddings outperforms attention‑based fusion. The model achieves roughly 90% precision and 70%‑80% recall, still far from the dual‑90% target of many NLU tasks.

Ambiguous‑intent detection is tackled by three sub‑tasks: scrambled‑nonsense detection, incomplete‑utterance detection, and intent‑vague detection. Scrambled detection uses language‑model perplexity (LSTM → BERT → GPT‑style autoregressive models) with large training corpora and sizable models. Boundary cases are handled by a secondary classifier that incorporates richer features.

Incomplete‑utterance detection is approached both as a language‑model next‑token prediction problem and as a binary classifier that can leverage contextual dialogue history. Multi‑turn modeling uses BERT‑based sentence‑pair classification to decide whether a current utterance is complete given the previous turn.

User feedback (mis‑reject and over‑accept signals) is incorporated both online (dynamic strategy adjustment) and offline (feedback‑driven data augmentation) to further refine the models, with personalized strategies that consider historical user behavior.

The overall system demonstrates that deep neural networks leveraging both acoustic and textual cues can approach human‑level performance on non‑human interaction detection, while ambiguous‑intent detection remains a high‑ambiguity challenge requiring continued research.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

machine learning AI speech recognition voice interaction dialogue system invalid query

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.