Industry Insights 10 min read

How Youku’s Multi‑Modal Search Engine Powers Billion‑Scale Video Retrieval

This article details the design and implementation of Youku’s Multi‑Modal Search Engine (MMS), covering its distributed multi‑level indexing architecture, vector retrieval using Aitheta, cross‑modal query scheduling, graph‑based execution engine, and real‑world applications such as intelligent video search and image‑based series lookup.

Youku Technology

Jun 17, 2020

How Youku’s Multi‑Modal Search Engine Powers Billion‑Scale Video Retrieval

Background

With the rapid growth of smartphones and mobile internet, the amount and variety of multimodal data (text, images, audio, video) have exploded. Advances in computing, storage, and AI technologies enable richer experiences on cloud and edge devices, prompting the need for more capable multimedia search capabilities.

System Overview

Youku, as a massive video platform, stores huge amounts of OGC and UGC content, which includes high‑dimensional multimodal data: titles, descriptions, comments (text), video frames (images), audio, and continuous video segments. Traditional inverted‑index search engines handle only text, lacking multimedia retrieval power. To address this, Youku designed and built a Multi‑Level Multi‑Modal Search Engine (MMS) that provides distributed large‑scale indexing, low‑latency cross‑modal retrieval, and multi‑level fusion, ranking, and sorting.

Key Technologies

1. Distributed Multi‑Level Multi‑Modal Index Structure

Each level builds an independent distributed index, supporting both inverted and vector indexes. For videos, frames, and faces, the index hierarchy includes:

Video meta‑text: limited fields (name, program info) for precise recall.

Video frame vectors: key‑frame embeddings plus meta‑text for frame‑level search.

Face vectors: recognized celebrity/person vectors stored in a separate retrieval library.

The hierarchy links video → frame → face, with vector indexes spanning billions of entries (e.g., 860 million frame vectors, 38 million face vectors) across ten shards, achieving >90% top‑10 recall.

2. Vector Retrieval

Both frame and face indexes contain vectors. After extensive testing, the team selected aitheta as the vector retrieval engine because it outperforms FAISS in latency and recall at massive scale. Aitheta is integrated via the indexlib plugin, and the team further optimized it with dimensionality reduction and automated parameter tuning for billions‑scale vectors.

On top of aitheta, the search service uses Ha3 ’s vector query capabilities, adding match‑score returns, multi‑vector query support, and source‑vector tagging for downstream business logic.

3. Retrieval Scheduling

MMS must orchestrate cross‑level and cross‑modal queries in real time. The system defines standard cross‑level and cross‑modal rules, forming an online retrieval logic based on user input. The workflow includes:

Cross‑level expansion: start from the user‑specified level and adaptively infer target levels.

Cross‑modal expansion: unify representations of different modalities into a common space for vector search, enabling text‑to‑vector and vector‑to‑text retrieval.

The overall retrieval flow is visualized in the diagram (see image below).

4. Graph‑Based Execution Engine

To meet complex retrieval logic and low‑latency requirements, MMS adopts the Suez graph execution engine, which combines a DAG executor with business‑logic operators. This engine abstracts operators, allowing flexible composition and reuse. It integrates with Alibaba Search AI·OS, leveraging TensorFlow‑style operators for data flow.

5. General‑Purpose Operators

The search pipeline implements generic operators on the graph engine:

Query parser : parses simple‑text or vector queries, supports combined text‑vector queries, and offers advanced syntax for coarse‑to‑fine ranking control.

Merge : fuses multi‑level documents, enriching them with forward‑index and summary information.

Sort : applies ranking logic and selects the top‑N results.

Result : formats output as JSON/XML or protobuf, with optional Snappy/LZ4 compression and debugging parameters for traceability.

Application Scenarios

1. Youku Intelligent Search

MMS indexes videos, frames, and elements (people, actions) across multiple levels, enabling precise frame‑level playback and satisfying user demands for exact video snippets.

2. Image‑Based Series Search

Users can upload or capture images to search for characters, programs, or similar scenes. The system supports both face‑recognition‑based name lookup and direct image‑vector retrieval.

Conclusion & Outlook

Multimedia content continues to grow explosively with 5G, live streaming, and short‑video trends. As AI matures, multimodal understanding and representation will improve, making multimodal human‑computer interaction pervasive. Large‑scale, multi‑level, multimodal retrieval, as realized by Youku’s MMS, is a core capability for future intelligent applications across video distribution, creation, and broader interactive scenarios.

vector-retrieval large-scale indexing Video platform graph execution engine multimodal search

Written by

Youku Technology

Discover top-tier entertainment technology here.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.