Artificial Intelligence 15 min read

Video Search Technology and Multi-modal Applications at Alibaba Youku

Alibaba’s Youku video search platform combines six-layer architecture—data extraction, technology integration, recall, relevance, ranking, and intent understanding—leveraging CV, NLP, knowledge graphs, and multi‑modal cues such as face, OCR, and audio recognition to overcome title‑mismatch, entity, and semantic challenges and deliver precise, diverse video retrieval.

Youku Technology
Youku Technology
Youku Technology
Video Search Technology and Multi-modal Applications at Alibaba Youku

Video search is a comprehensive application scenario involving information retrieval, natural language processing (NLP), machine learning, and computer vision (CV). With significant advancements in deep learning across these fields and the widespread user demand for video production and consumption, video search technology has developed rapidly in both academia and industry.

This article shares the video search technology and multi-modal applications from Alibaba's Youku, presented by Senior Algorithm Expert Ruo Ren at GMIC 2020.

Search System Framework:

The search algorithm framework consists of six layers from bottom to top: data layer, technology layer, content recall, multi-media relevance, ranking, and intent understanding.

1) Data Layer: Extracts knowledge from video content including entities, relationships between entities, and attributes. Uses knowledge graphs to organize content from timeliness dimensions.

2) Technology Layer: Uses CV and NLP technologies to support content recall, relevance, ranking, and Query intent understanding.

3) Recall Layer: Focuses on multi-media content understanding.

4) Relevance: Includes basic relevance and semantic matching technologies.

5) Ranking Layer: Uses machine learning learning-to-rank methods to improve distribution effects while optimizing experience targets such as timeliness and diversity.

6) Intent: Performs component analysis on Query to identify what each component represents (program name, series information, etc.) and establishes a fine-grained intent system.

Challenges in Video Search:

1) Content Relevance Matching: User-expressed Queries may not directly match video titles. Content understanding and analysis are needed to enrich metadata and establish content relevance.

2) Entity Knowledge Matching: Using structured understanding of video titles with NER methods, combined with CV technology to assist NER recognition accuracy.

3) Semantic Matching: For semantic/How-to type knowledge matching, comprehensive analysis using content understanding and entity knowledge supplementation is required.

Multi-modal Video Search Practice:

Traditional text-based retrieval faces difficulties: single-modal information loss (UGC video titles are often brief and unclear), diversified user search intents, and ToB content creation needs requiring various video clips and materials.

Youku's multi-modal search approach:

1) Uses CV algorithms to reduce other modal information to text modality

2) Implements multi-modal content retrieval for recall

3) Uses content relevance and ranking technology to meet user retrieval needs

Techniques include face recognition to identify celebrities, OCR/ASR for dialogue-to-text conversion, keyword extraction, music recognition, action recognition, scene recognition, and emotion recognition. Knowledge graphs provide entity coverage across industries to extract core content themes, while coreference reasoning capabilities help understand relationships between entities and content.

machine learningNatural Language Processinginformation retrievalsearch rankingKnowledge Graphvideo searchmulti-modal learning
Youku Technology
Written by

Youku Technology

Discover top-tier entertainment technology here.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.