Artificial Intelligence 19 min read

Voice and Language Technologies in Natural Interaction: iQIYI HomeAI Speech Interaction System

The talk introduced iQIYI’s HomeAI platform, which combines user profiling (including voiceprint and age detection) with automatic video semantic extraction to enable natural, multi‑turn voice‑based video search—addressing hot‑content updates, contextual awareness, device environments, and personalized recommendations for screen‑less or accessibility‑focused users.

iQIYI Technical Product Team
iQIYI Technical Product Team
iQIYI Technical Product Team
Voice and Language Technologies in Natural Interaction: iQIYI HomeAI Speech Interaction System

This article reports on a technical talk from the iQIYI Technology Salon titled “Practice of Voice and Language Technologies in Natural Interaction.” The speaker, iQIYI researcher Shane Wang, introduced the HomeAI intelligent voice interaction platform and shared practical experiences in applying voice technology to video search.

The presentation was divided into five parts: (1) the application scenarios of HomeAI; (2) differences between voice‑based video search and conventional text‑based search; (3) support for newly released (“hot”) content; (4) contextual and user‑environment considerations; and (5) the synergy between HomeAI and video content understanding.

HomeAI is built on two pillars—user profiling and video content analysis. On the user side, standard speech‑recognition and intent‑understanding pipelines are enriched with advanced research such as age detection, voiceprint extraction, and voiceprint matching to obtain personalized user attributes.

On the video side, AI replaces manual tags with automatic extraction of actors, dialogues, scenes, actions, etc., enabling a semantic understanding of video assets that can be exposed to higher‑level business logic.

Three typical situations motivate voice‑based video search: (a) devices without screens (e.g., smart speakers); (b) inconvenient text input on large screens (e.g., smart TVs); and (c) special user groups such as children or the elderly.

The speaker highlighted three key differences between voice and text search: ambiguity in program names, lack of visible classification in voice UI, and the importance of context across multi‑turn interactions.

The standard voice search pipeline consists of ASR → intent extraction → structured query → video‑library retrieval. Challenges arise with hot content because ASR often misrecognizes newly released titles and homophones, requiring frequent updates to language models and entity libraries.

To keep the entity library up‑to‑date, iQIYI continuously ingests recent textual corpora, synthesizes them into a base language model, and merges a video‑domain‑specific model. Entity vectors are injected into the model so that newly added entities (e.g., new drama titles, actors) are quickly recognized without degrading existing capabilities.

Intent recognition is enhanced by two additional vector streams: an acoustic‑vector that captures pronunciation to tolerate ASR errors, and an entity‑type vector that helps the model disambiguate unknown words by their semantic class (actor, title, location, etc.).

Context handling is treated as a multi‑turn dialogue state machine. The system evaluates whether a follow‑up utterance is a refinement of the previous query or a new request, and it compares the combined‑context result with a single‑turn result to select the most reasonable answer.

Result quality is assessed using prior probability (popularity of a video) and posterior probability (relevance to the user). The system prefers results that cover more user‑specified keywords while avoiding over‑coverage that adds no new information.

User attributes are further refined through voiceprint clustering, enabling the system to differentiate family members sharing the same device and to personalize recommendations based on individual viewing habits.

Device environment (type of device, UI state, playback progress) is also incorporated. UI tags can be injected or automatically parsed so that spoken commands can interact with on‑screen elements.

Finally, video semantic extraction is performed offline: basic tags (people, objects, scenes, actions, dialogue, BGM) are generated and stored in a database, then fed into the language model so that spoken queries can be matched to these tags. During playback, the system can answer context‑aware questions such as “Who is this?” or perform intelligent jumps based on the current timestamp.

aiNatural Language Processingcontext-awareSpeech Recognitionentity extractionvideo searchvoice interaction
iQIYI Technical Product Team
Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.