Artificial Intelligence 12 min read

Multi‑Level Multi‑Modal Search Engine and Graph Engine for Video Content at Youku

The article presents a detailed technical overview of Youku's video search system, covering multi‑modal inputs, multi‑level element indexing, face search, cross‑level and cross‑modal retrieval, and the design and applications of a multimodal graph engine with knowledge‑graph integration.

DataFunTalk

Jul 8, 2020

Multi‑Level Multi‑Modal Search Engine and Graph Engine for Video Content at Youku

The talk introduces Youku's video search business, which spans multiple client platforms (APP, Web, OTT, TV, mini‑programs) and adapts result layouts to user intent, supporting text, voice, and multimodal inputs such as images and video.

It explains the concept of multi‑level video element content, from macro levels (channels, programs) to micro levels (frames, objects, scenes, narratives), and how AI models extract these elements for recall.

Face search is demonstrated as a multimodal input: users upload or capture photos, the system compresses and uploads the image, runs AUS services for face detection and vectorization, and retrieves matching video segments via vector and inverted‑index search.

The multi‑level multimodal engine architecture is described, including a three‑tier index structure (vector, inverted, KV/KKV) and a DAG‑based execution framework that merges results across levels and performs cross‑modal retrieval between text and vectors.

Key engine steps are multi‑level merge and top‑level ranking, enabling both C‑end (e.g., searching for a celebrity’s dance clips) and B‑end use cases (e.g., searching specific shots or dialogue lines for content creation).

The presentation then shifts to the multimodal graph engine, outlining its role in building a entertainment‑industry knowledge graph, leveraging Apache TinkerPop/Gremlin for graph traversal, and integrating multimodal vector queries to answer complex queries such as “actors similar to Sun Li”.

Graph engine indexing is based on distributed vertex/edge shards with KV, KKV, inverted, and vector indexes, supporting efficient graph traversal and model‑service steps that can invoke local or remote deep‑learning models during retrieval.

Finally, the summary reflects on the growing opportunities and challenges as AI enriches video search, emphasizing the need for robust multimodal retrieval, knowledge‑graph integration, and scalable graph‑based services to improve user interaction across devices.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI Multimodal knowledge graph retrieval video search graph engine face search

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.