Artificial Intelligence 11 min read

Learning Pixel-Level Distinctions for Video Highlight Detection

The Alibaba Mom Creative & Video Platform team introduces PLD‑VHD, a pixel‑level distinction learning framework that uses a 3D CNN encoder‑decoder with temporal and saliency modules to detect highlights, achieving state‑of‑the‑art results on public benchmarks and a 4,724‑video e‑commerce dataset, and boosting ad revenue through precise clipping and cropping.

Alimama Tech

Jun 22, 2022

Learning Pixel-Level Distinctions for Video Highlight Detection

This article presents the Alibaba Mom Creative & Video Platform team's research on video highlight detection, a method accepted at CVPR 2022. The work addresses the need for short (3‑10 s), silent videos in e‑commerce search and recommendation streams, where only the most attractive moments can capture user attention.

Existing video highlight detection (VHD) approaches rely on segment‑level content and achieve strong results on public datasets such as YouTube Highlight, TVSum, and CoSum. They are typically supervised (e.g., Video2GIF, LSVM) or weakly supervised (e.g., LIM) and treat each video fragment independently, ignoring fine‑grained temporal context and auxiliary signals like eye‑tracking.

The proposed PLD‑VHD method introduces pixel‑level distinction learning. An encoder‑decoder network processes a sliding window of 32 frames with a 3D CNN to capture temporal context. A temporal module fuses frame features, while an auxiliary spatial module generates pseudo‑labels via video saliency detection, leveraging eye‑movement data as a strong cue for highlight moments.

Experiments on the three public benchmarks demonstrate state‑of‑the‑art performance, with quantitative results surpassing previous methods. Visualizations show precise pixel‑wise highlight maps.

To suit e‑commerce scenarios, the team built a proprietary dataset of 4,724 product videos (15‑60 s each) annotated by multiple raters. Frames were scored based on consensus, producing binary or regression labels for training.

Applying PLD‑VHD to ad creative pipelines enables simultaneous temporal clipping and spatial cropping, because the predicted highlight map guides both segment selection and region‑of‑interest cropping. Deployment in Alibaba’s search, display, and content ads yielded noticeable online revenue gains.

In summary, PLD‑VHD offers a fine‑grained, interpretable solution for video highlight detection and sets the stage for future personalized models.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Encoder-Decoder eye tracking pixel-level distinction video highlight detection

Written by

Alimama Tech

Official Alimama tech channel, showcasing all of Alimama's technical innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.