Learning Pixel-Level Distinctions for Video Highlight Detection
The Alibaba Mom Creative & Video Platform team introduces PLD‑VHD, a pixel‑level distinction learning framework that uses a 3D CNN encoder‑decoder with temporal and saliency modules to detect highlights, achieving state‑of‑the‑art results on public benchmarks and a 4,724‑video e‑commerce dataset, and boosting ad revenue through precise clipping and cropping.
This article presents the Alibaba Mom Creative & Video Platform team's research on video highlight detection, a method accepted at CVPR 2022. The work addresses the need for short (3‑10 s), silent videos in e‑commerce search and recommendation streams, where only the most attractive moments can capture user attention.
Existing video highlight detection (VHD) approaches rely on segment‑level content and achieve strong results on public datasets such as YouTube Highlight, TVSum, and CoSum. They are typically supervised (e.g., Video2GIF, LSVM) or weakly supervised (e.g., LIM) and treat each video fragment independently, ignoring fine‑grained temporal context and auxiliary signals like eye‑tracking.
The proposed PLD‑VHD method introduces pixel‑level distinction learning. An encoder‑decoder network processes a sliding window of 32 frames with a 3D CNN to capture temporal context. A temporal module fuses frame features, while an auxiliary spatial module generates pseudo‑labels via video saliency detection, leveraging eye‑movement data as a strong cue for highlight moments.
Experiments on the three public benchmarks demonstrate state‑of‑the‑art performance, with quantitative results surpassing previous methods. Visualizations show precise pixel‑wise highlight maps.
To suit e‑commerce scenarios, the team built a proprietary dataset of 4,724 product videos (15‑60 s each) annotated by multiple raters. Frames were scored based on consensus, producing binary or regression labels for training.
Applying PLD‑VHD to ad creative pipelines enables simultaneous temporal clipping and spatial cropping, because the predicted highlight map guides both segment selection and region‑of‑interest cropping. Deployment in Alibaba’s search, display, and content ads yielded noticeable online revenue gains.
In summary, PLD‑VHD offers a fine‑grained, interpretable solution for video highlight detection and sets the stage for future personalized models.
Alimama Tech
Official Alimama tech channel, showcasing all of Alimama's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.