Artificial Intelligence 12 min read

Multi-level Multi-modal Search Engine and Graph Engine for Billion-scale Video Content

An advanced multi‑level, multi‑modal search and graph engine for Youku processes text, voice, image and video queries across hierarchical video elements, using combined vector and inverted indexes to merge cross‑level and cross‑modal results, while a distributed knowledge‑graph layer enables multimodal graph traversal for billion‑scale video retrieval.

Youku Technology
Youku Technology
Youku Technology
Multi-level Multi-modal Search Engine and Graph Engine for Billion-scale Video Content

Overview : The presentation introduces a specialized multi-level, multi-modal retrieval engine designed for recommendation and search in the video domain, particularly for Youku. It covers the evolution from text‑based queries to rich multimodal inputs such as images, video clips, and voice.

Business Context : Youku video search operates across multiple client platforms (APP, Web, TV, OTT, mini‑programs) and adapts result layouts based on user intent (program‑centric, generic queries, UGC, celebrity). Multi-round dialogue is supported on large‑screen devices.

User Interaction Modes : Text search, alphabetic search on TV, voice search, and emerging multimodal search (image/video capture). A face‑search feature allows users to upload or select a photo to find similar celebrity faces.

Multi-level Video Elements : Content is organized hierarchically – macro level (channels, programs) and micro level (frames, regions, objects, narrative scenes, stories). AI models extract these elements to build fine‑grained video indexes.

Multimodal Input Pipeline : Images or videos are compressed and cropped on the client, uploaded via the AUS service, and processed by an online image computation service (face detection, vectorization). Features are then used for vector and inverted‑index retrieval.

Engine Design : The index structure consists of three levels, combining vector indexes and inverted indexes for each video. Cross‑level retrieval merges results from different granularities, while cross‑modal retrieval bridges text and vector spaces (e.g., face vectors ↔ name texts).

Key Engine Steps : (1) Multi‑level merge – automatic traversal across levels to form a retrieval path; (2) Top‑level multi‑level ranking – final sorting based on combined recall information.

Applications : C‑end examples include searching for “Yi Yangqianxi dancing” and receiving relevant video clips with timestamps. B‑end use cases cover searching for specific shots or dialogue lines within series (e.g., “Long‑march Twelve Hours” clips).

Multimodal Graph Engine : Extends the search engine with a knowledge graph for the entertainment industry. It leverages Apache TinkerPop/Gremlin for graph traversal, supporting both traditional graph queries and multimodal vector queries. Example scenario: finding dramas featuring actresses who resemble a given celebrity.

Graph Engine Indexing : Distributed vertex/edge indexes combine KV, KKV, inverted, and vector structures to enable efficient multi‑round graph traversal and model‑in‑the‑loop scoring.

Future Outlook : As AI capabilities mature, richer knowledge bases and multimodal retrieval will further enhance user interaction across devices, presenting both challenges and opportunities for large‑scale video search.

AIknowledge graphvideo retrievallarge-scale indexinggraph enginemultimodal search
Youku Technology
Written by

Youku Technology

Discover top-tier entertainment technology here.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.