Artificial Intelligence 27 min read

Deep Semantic Relevance and Multimodal Video Search at Alibaba Entertainment

The presentation by Alibaba Entertainment's senior algorithm expert details the challenges of video search in the 4G/5G era and describes a comprehensive framework covering business overview, relevance and ranking, multimodal retrieval, deep semantic modeling, dataset construction, and practical deployment techniques.

DataFunTalk

Nov 16, 2020

Deep Semantic Relevance and Multimodal Video Search at Alibaba Entertainment

The talk, presented by Ru Chen, senior algorithm expert at Alibaba Entertainment, introduces the unique challenges of video search in the 4G/5G era and outlines four main topics: business overview, relevance and ranking, multimodal video search, and deep semantic relevance.

Business Overview : Alibaba Entertainment offers a one‑stop search and recommendation service across platforms such as Youku, OTT, PC, and apps, covering both copyrighted OGC videos and massive UGC content, as well as related media like performances and novels, serving billions of video assets.

User Value & Evaluation Metrics : Two dimensions are highlighted – the tool attribute (accurate and complete retrieval) measured by experience metrics such as bounce rate, relevance, timeliness, diversity, and playability, and the distribution attribute (view count and watch time) that drives revenue.

Search Algorithm Framework : The architecture consists of a data layer (content extraction and knowledge‑graph aggregation), a basic‑technology layer (CV and NLP), an intent layer (query tagging and fine‑grained intent understanding), a recall layer (multimedia understanding to bridge the semantic gap), a relevance layer (semantic matching and deep semantic computation), and a ranking layer (multi‑objective LTR optimizing both experience and distribution).

Relevance Challenges : The system must handle heterogeneous content understanding, entity knowledge matching, and deep semantic computation, which are more demanding for video than for text.

Relevance Feature Sets include basic textual features, knowledge features derived from content understanding and knowledge graphs, posterior features from click‑through logs, and semantic features using models such as DSSM and BERT.

Relevance Dataset Construction : Relevance data are built via crowdsourced labeling with tiered relevance levels and partial‑order annotations; a validation set targets specific online issues, while a training set drives iterative model improvement. Cost‑effective sample discovery uses a Q‑learning‑based approach.

Ranking Feature System : Combines query, document, and match features with platform‑specific signals such as real‑time playback control to address both experience and distribution goals.

Multimodal Video Search : Because video titles are often short, a three‑stage pipeline is employed: (1) CV techniques convert visual/audio signals to textual representations (OCR, face/scene recognition, keyword extraction); (2) multimodal recall retrieves candidates; (3) multimodal relevance ranking refines results.

Deep Semantic Relevance Framework : A three‑stage model is used – Transfer (pre‑train a general BERT on search logs to obtain a domain‑specific semantic model), Adapt (multi‑task fine‑tuning for query analysis, recall, and ranking), and Distill (multi‑stage knowledge distillation with both unlabeled and labeled data).

Model Selection & Deployment : Symmetric dual‑tower models are accurate but costly; an asymmetric dual‑tower design stores multiple document embeddings offline and uses a lightweight three‑layer BERT for queries online, with attention‑based scoring and multi‑stage distillation to narrow the performance gap.

Knowledge‑Enhanced Semantic Matching : To handle queries that require both knowledge and semantics, KG sub‑graph embeddings are serialized and combined with text embeddings via attention mechanisms, enabling effective matching for complex queries.

Effectiveness Cases : Real‑world examples demonstrate improved ranking order and the ability to retrieve videos where query terms are absent from titles, confirming the benefits of the new relevance system.

The session concludes with thanks and an invitation to join the DataFunTalk community for further discussion.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Deep Learning information retrieval Multimodal semantic matching video search relevance ranking

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.