Video Highlight Detection and GIF Cover Generation Using 3D Convolutional Scoring
The paper proposes a 3D‑CNN scoring system that ranks short, information‑dense video segments, selects the most exciting one, and converts it into a looping GIF cover, replacing static thumbnails; trained on large video‑GIF datasets with pairwise ranking loss, it improves click‑through rates while reducing bad‑case generation.
The paper addresses the problem of selecting the most informative and attractive video segment to replace static thumbnail images with animated GIFs on news platforms. Static thumbnails provide limited information, while automatically playing the beginning of a video often shows irrelevant or advertising content, harming user experience.
To solve this, the authors propose extracting the most exciting segment—defined as a short clip that independently conveys a significant event (e.g., a car crash, a goal)—and converting it into a looping GIF. The task is transformed into a scoring problem: each video segment receives a score, and higher scores indicate more suitable highlights.
Algorithm Idea
A video is divided into multiple temporally ordered but partially independent segments. Segments with strong independence and high information density are labeled as “exciting,” while transition or meaningless clips are labeled as “non‑exciting.” The goal is to train a model that assigns higher scores to exciting segments.
Data Collection
Two data sources are used: (1) an open‑source dataset (video2gifdatas) containing 120,000 video‑GIF pairs manually created by users, and (2) a manually annotated set where each video provides four GIFs labeled as good, average, or bad. Pairs of exciting and non‑exciting segments are constructed for training.
Model Training
The model scores video segments using a 3D convolutional neural network (3D‑CNN) to capture spatio‑temporal features, followed by fully‑connected layers that output a score in the range [0,1]. 3D‑CNNs (e.g., C3D, I3D, P3D, R(2+1)D) are compared, and I3D is found to perform best because it can be fine‑tuned from pretrained weights and preserves full temporal information.
The training adopts a pairwise ranking loss: for each pair (exciting, non‑exciting), the loss penalizes cases where the exciting segment receives a lower score than the non‑exciting one. The loss function is illustrated in the original figures.
Model Prediction
During inference, videos are first segmented using a scene‑segmentation algorithm, and 16 uniformly sampled frames per segment are fed to the model. The segment with the highest score is selected and rendered as a GIF cover. The end‑to‑end pipeline is shown in Figure 4 of the source.
Evaluation Metrics
Three metrics are reported: nMSD (normalized mean segment distance, lower is better), nACC (pairwise accuracy, higher is better), and bad‑case rate (percentage of generated GIFs that are meaningless or disruptive, lower is better). The model achieves nMSD = 45.4 %, nACC = 58.5 %, and a bad‑case rate of 11 %.
Online Performance
The GIF generation algorithm was deployed in an A/B test on NetEase News. Click‑through rate (CTR) of GIF covers significantly outperformed static covers, though average reading time per article decreased because users could grasp video content from the GIF without watching the full video.
Future Work
Improvements are suggested for video segmentation (e.g., using advanced action‑recognition methods) and feature extraction (e.g., two‑stream networks) to further enhance GIF quality.
NetEase Media Technology Team
NetEase Media Technology Team
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.