Artificial Intelligence 20 min read

Detecting Invalid Queries in Voice Interaction: Non‑Human Interaction and Ambiguous Intent Recognition

This talk presents a comprehensive study of invalid query detection in voice assistants, covering the definition of effective and ineffective queries, challenges of non‑human interaction and ambiguous intent recognition, data collection, model design, experimental results, user‑feedback loops, and future research directions.

DataFunSummit

Apr 1, 2022

Detecting Invalid Queries in Voice Interaction: Non‑Human Interaction and Ambiguous Intent Recognition

Speaker: Cui Shiqi, Head of Continuous Dialogue Algorithm at Xiaomi AI.

Introduction: The presentation focuses on recognizing invalid queries in voice interaction systems, which are divided into effective queries (fulfillable) and invalid queries (unfulfillable).

1. Invalid Query Overview

Invalid queries are categorized into two main types: non‑human interaction queries (background speech captured by the device) and ambiguous‑intent queries (the system cannot determine the user’s intention).

1.1 Effective Query Types

Single‑turn clear intent: the current utterance directly reveals the user’s goal (e.g., "turn on the living‑room AC").

Context‑clear intent: the surrounding context (scene, device state, user history) disambiguates the intent.

Multiple intents: the utterance matches several possible intents without sufficient context.

1.2 Invalid Query Types

Non‑human interaction: speech from by‑standers or ambient noise mistakenly captured as a command.

Ambiguous intent: includes (a) scrambled/no‑meaning utterances caused by ASR errors, (b) incomplete expressions due to truncation, and (c) vague queries lacking clear intent words.

These invalid queries account for 5%–20% of all requests across Xiaomi devices, making their mitigation important.

2. Non‑Human Interaction Recognition

2.1 Challenges

Information‑incomplete ML task: requires acoustic, prosodic, and possibly visual cues, but most systems rely only on audio.

Variability of speech: tone, speed, background noise, and speaker differences hinder generalization.

Cocktail‑party effect: distinguishing the target speaker in noisy environments is difficult for machines.

2.2 Problem Modeling

Pure NLU approaches using only the query text perform poorly; audio signals are essential. Adding previous turn information yields little gain because non‑human queries are largely independent across turns.

2.3 Solution

The task is framed as a single‑turn binary classification problem. Key steps include dataset construction and feature/model selection.

2.3.1 Dataset Construction

High annotation cost: audio labeling requires skilled annotators; building a 100k‑sample set needs ~100 person‑days.

Quality improvement: detailed labeling guidelines, multi‑annotator verification, and iterative data cleaning raise label consistency.

Sample mining: random sampling for diversity, targeted mining for positive (non‑human) samples using ASR confidence or wake‑word detection, and hard‑sample mining via model scores and user feedback.

2.3.2 Model Architecture

A deep neural network combines four feature groups:

Speech features: raw audio processed into spectrograms (frame, window, FFT) fed to a CNN‑LSTM‑Attention encoder.

Text features: word embeddings of the ASR transcript encoded by a TextCNN.

High‑level acoustic features: ASR confidence scores.

High‑level semantic features: NLU‑extracted domain intents and slots.

The speech encoder uses multi‑layer CNNs, followed by LSTM and Attention; the text encoder uses TextCNN. Outputs are concatenated with high‑level features and passed to separate classification heads for first‑turn and non‑first‑turn queries.

2.4 Experimental Findings

Spectrogram features outperform MFCC/fbank for this task.

Deeper CNN kernels improve robustness; adding LSTM+Attention yields further gains.

TextCNN performs on par with Transformer/BERT for this classification.

Simple concatenation of speech and text embeddings outperforms attention‑based fusion.

3. Ambiguous‑Intent Recognition

3.1 Types of Ambiguity

Scrambled/no‑meaning queries: high perplexity sentences detected via language‑model perplexity.

Incomplete expressions: sentences cut off or missing final tokens.

Vague intent: queries lacking clear intent words.

3.2 Scrambled Query Detection

Perplexity is computed using large‑scale language models (LSTM, BERT, or GPT‑style autoregressive models). BERT‑based perplexity is accurate but computationally heavy; GPT‑style models reduce inference cost while maintaining performance.

3.3 Incomplete Expression Detection

Two approaches are explored:

Predict the next token as an end‑of‑sentence marker using a language model.

Binary classification using query text and auxiliary features (e.g., bigram, mutual information) to handle colloquial and truncated utterances.

Both single‑turn and multi‑turn models are employed: the single‑turn model judges completeness directly, while the multi‑turn model re‑evaluates using the previous turn context.

4. User Feedback Integration

User feedback is leveraged in two ways:

Online: mis‑rejection feedback (user repeats the command) triggers dynamic strategy adjustment.

Offline: both mis‑rejection and false‑accept feedback are mined to retrain models.

Personalized strategies consider historical user behavior to reduce unnecessary rejections.

5. Evaluation and Current Capability

Non‑human interaction detection achieves ~90% precision and 70%–80% recall, still below the desired 90%/90% balance. Human annotators show similar variability, indicating the intrinsic difficulty of the task.

Conclusion

The talk summarized two core ML tasks in voice assistants: non‑human interaction detection and ambiguous‑intent detection, describing data pipelines, model designs, experimental results, and future research directions.

Images referenced in the original slides are retained below:

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

machine learning speech recognition voice interaction natural language understanding invalid query detection

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.