Video Search at Youku: Algorithmic Practices, Relevance, Ranking, and Multimodal Techniques
This article presents a comprehensive overview of Youku's video search system, covering business background, evaluation metrics, system and algorithm frameworks, relevance and ranking feature engineering, dataset construction, semantic matching, multimodal video understanding, and practical case studies that illustrate the impact of deep learning and AI techniques on search performance.
Introduction Video search combines information retrieval, natural language processing, machine learning, and computer vision; rapid advances in deep learning and growing user demand have driven both academic and industrial progress. The article uses Youku as a case study to share algorithmic practice.
1. Business Background Youku provides a unified search service for Alibaba Entertainment, covering apps, OTT, ticketing platforms, and content types such as movies, series, comics, user‑generated videos, talent, performances, and news.
2. Evaluation Metrics Search quality is measured by two main dimensions: tool attributes (relevance, freshness, diversity, playability) and distribution attributes (view count, watch time, commercial value). Metrics include bounce rate, relevance scores, and multi‑objective targets.
3. System Framework The architecture consists of a search gateway, query‑understanding (QP) service, engine layer (coarse and fine ranking), offline indexing, and a machine‑learning platform for feature streaming and online model updates.
4. Algorithm Framework Relevance is handled in the fine‑ranking stage, while the ranking service integrates prediction models, model fusion, and business policies to balance user experience and efficiency.
5. Relevance Feature Stack Four feature levels are described: basic term‑weight features, knowledge‑enhanced features (NER, entity linking), posterior features (click‑through models such as UBM/DBN), and semantic features (sentence‑level embeddings, BERT, SMT, query rewriting).
6. Dataset Construction & Feature System A multi‑year, multi‑label relevance dataset is built using active‑learning‑driven annotation, with both absolute relevance grades and pairwise preferences. Separate training and validation sets target specific business problems.
7. Semantic Matching Sentence‑level semantic models (e.g., DSSM, BERT‑based) are trained on both internal and external corpora; negative samples are carefully selected to reflect real‑world click behavior.
8. Ranking Feature System Features include domain‑specific signals (real‑time promotion status, video playability), quality assessments (contrast, brightness, distortion), and user behavior statistics, all combined in a deep LTR model.
9. Multi‑Objective Deep LTR A multi‑task loss combines relevance, ranking, and entity objectives, with sample weighting that balances video length and completion rate to avoid bias toward short or long videos.
10. Multimodal Video Search Multimodal search fuses audio, text, and visual cues. Techniques include OCR/ASR for speech transcription, CV for object/scene/person detection, and shot/key‑frame extraction to build fine‑grained video elements.
11. Video‑Level Knowledge Graph Recognized entities (people, places, items) are linked to a knowledge graph, enabling precise segment‑level retrieval (e.g., searching for a specific actor’s appearance).
12. Effect Cases Real‑world examples demonstrate large lifts in recall and user engagement when multimodal cues and knowledge‑graph enrichment are applied.
Conclusion The presentation highlights the end‑to‑end pipeline from query understanding to multimodal representation, emphasizing the need for continued research on story‑level video comprehension.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.