Artificial Intelligence 12 min read

Multi-level Multi-modal Search Engine and Graph Engine for Billion-scale Video Content

An advanced multi‑level, multi‑modal search and graph engine for Youku processes text, voice, image and video queries across hierarchical video elements, using combined vector and inverted indexes to merge cross‑level and cross‑modal results, while a distributed knowledge‑graph layer enables multimodal graph traversal for billion‑scale video retrieval.

Youku Technology

Jul 9, 2020

Multi-level Multi-modal Search Engine and Graph Engine for Billion-scale Video Content

Overview : The presentation introduces a specialized multi-level, multi-modal retrieval engine designed for recommendation and search in the video domain, particularly for Youku. It covers the evolution from text‑based queries to rich multimodal inputs such as images, video clips, and voice.

Business Context : Youku video search operates across multiple client platforms (APP, Web, TV, OTT, mini‑programs) and adapts result layouts based on user intent (program‑centric, generic queries, UGC, celebrity). Multi-round dialogue is supported on large‑screen devices.

User Interaction Modes : Text search, alphabetic search on TV, voice search, and emerging multimodal search (image/video capture). A face‑search feature allows users to upload or select a photo to find similar celebrity faces.

Multi-level Video Elements : Content is organized hierarchically – macro level (channels, programs) and micro level (frames, regions, objects, narrative scenes, stories). AI models extract these elements to build fine‑grained video indexes.

Multimodal Input Pipeline : Images or videos are compressed and cropped on the client, uploaded via the AUS service, and processed by an online image computation service (face detection, vectorization). Features are then used for vector and inverted‑index retrieval.

Engine Design : The index structure consists of three levels, combining vector indexes and inverted indexes for each video. Cross‑level retrieval merges results from different granularities, while cross‑modal retrieval bridges text and vector spaces (e.g., face vectors ↔ name texts).

Key Engine Steps : (1) Multi‑level merge – automatic traversal across levels to form a retrieval path; (2) Top‑level multi‑level ranking – final sorting based on combined recall information.

Applications : C‑end examples include searching for “Yi Yangqianxi dancing” and receiving relevant video clips with timestamps. B‑end use cases cover searching for specific shots or dialogue lines within series (e.g., “Long‑march Twelve Hours” clips).

Multimodal Graph Engine : Extends the search engine with a knowledge graph for the entertainment industry. It leverages Apache TinkerPop/Gremlin for graph traversal, supporting both traditional graph queries and multimodal vector queries. Example scenario: finding dramas featuring actresses who resemble a given celebrity.

Graph Engine Indexing : Distributed vertex/edge indexes combine KV, KKV, inverted, and vector structures to enable efficient multi‑round graph traversal and model‑in‑the‑loop scoring.

Future Outlook : As AI capabilities mature, richer knowledge bases and multimodal retrieval will further enhance user interaction across devices, presenting both challenges and opportunities for large‑scale video search.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI video retrieval large-scale indexing graph engine multimodal search

Written by

Youku Technology

Discover top-tier entertainment technology here.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.