Multi‑Level Multi‑Modal Search Engine and Graph Engine for Video Content at Youku
The article presents a detailed technical overview of Youku's video search system, covering multi‑modal inputs, multi‑level element indexing, face search, cross‑level and cross‑modal retrieval, and the design and applications of a multimodal graph engine with knowledge‑graph integration.
The talk introduces Youku's video search business, which spans multiple client platforms (APP, Web, OTT, TV, mini‑programs) and adapts result layouts to user intent, supporting text, voice, and multimodal inputs such as images and video.
It explains the concept of multi‑level video element content, from macro levels (channels, programs) to micro levels (frames, objects, scenes, narratives), and how AI models extract these elements for recall.
Face search is demonstrated as a multimodal input: users upload or capture photos, the system compresses and uploads the image, runs AUS services for face detection and vectorization, and retrieves matching video segments via vector and inverted‑index search.
The multi‑level multimodal engine architecture is described, including a three‑tier index structure (vector, inverted, KV/KKV) and a DAG‑based execution framework that merges results across levels and performs cross‑modal retrieval between text and vectors.
Key engine steps are multi‑level merge and top‑level ranking, enabling both C‑end (e.g., searching for a celebrity’s dance clips) and B‑end use cases (e.g., searching specific shots or dialogue lines for content creation).
The presentation then shifts to the multimodal graph engine, outlining its role in building a entertainment‑industry knowledge graph, leveraging Apache TinkerPop/Gremlin for graph traversal, and integrating multimodal vector queries to answer complex queries such as “actors similar to Sun Li”.
Graph engine indexing is based on distributed vertex/edge shards with KV, KKV, inverted, and vector indexes, supporting efficient graph traversal and model‑service steps that can invoke local or remote deep‑learning models during retrieval.
Finally, the summary reflects on the growing opportunities and challenges as AI enriches video search, emphasizing the need for robust multimodal retrieval, knowledge‑graph integration, and scalable graph‑based services to improve user interaction across devices.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.