Artificial Intelligence 13 min read

Video Highlight Detection and GIF Cover Generation Using 3D Convolutional Scoring

The paper proposes a 3D‑CNN scoring system that ranks short, information‑dense video segments, selects the most exciting one, and converts it into a looping GIF cover, replacing static thumbnails; trained on large video‑GIF datasets with pairwise ranking loss, it improves click‑through rates while reducing bad‑case generation.

NetEase Media Technology Team
NetEase Media Technology Team
NetEase Media Technology Team
Video Highlight Detection and GIF Cover Generation Using 3D Convolutional Scoring

The paper addresses the problem of selecting the most informative and attractive video segment to replace static thumbnail images with animated GIFs on news platforms. Static thumbnails provide limited information, while automatically playing the beginning of a video often shows irrelevant or advertising content, harming user experience.

To solve this, the authors propose extracting the most exciting segment—defined as a short clip that independently conveys a significant event (e.g., a car crash, a goal)—and converting it into a looping GIF. The task is transformed into a scoring problem: each video segment receives a score, and higher scores indicate more suitable highlights.

Algorithm Idea

A video is divided into multiple temporally ordered but partially independent segments. Segments with strong independence and high information density are labeled as “exciting,” while transition or meaningless clips are labeled as “non‑exciting.” The goal is to train a model that assigns higher scores to exciting segments.

Data Collection

Two data sources are used: (1) an open‑source dataset (video2gifdatas) containing 120,000 video‑GIF pairs manually created by users, and (2) a manually annotated set where each video provides four GIFs labeled as good, average, or bad. Pairs of exciting and non‑exciting segments are constructed for training.

Model Training

The model scores video segments using a 3D convolutional neural network (3D‑CNN) to capture spatio‑temporal features, followed by fully‑connected layers that output a score in the range [0,1]. 3D‑CNNs (e.g., C3D, I3D, P3D, R(2+1)D) are compared, and I3D is found to perform best because it can be fine‑tuned from pretrained weights and preserves full temporal information.

The training adopts a pairwise ranking loss: for each pair (exciting, non‑exciting), the loss penalizes cases where the exciting segment receives a lower score than the non‑exciting one. The loss function is illustrated in the original figures.

Model Prediction

During inference, videos are first segmented using a scene‑segmentation algorithm, and 16 uniformly sampled frames per segment are fed to the model. The segment with the highest score is selected and rendered as a GIF cover. The end‑to‑end pipeline is shown in Figure 4 of the source.

Evaluation Metrics

Three metrics are reported: nMSD (normalized mean segment distance, lower is better), nACC (pairwise accuracy, higher is better), and bad‑case rate (percentage of generated GIFs that are meaningless or disruptive, lower is better). The model achieves nMSD = 45.4 %, nACC = 58.5 %, and a bad‑case rate of 11 %.

Online Performance

The GIF generation algorithm was deployed in an A/B test on NetEase News. Click‑through rate (CTR) of GIF covers significantly outperformed static covers, though average reading time per article decreased because users could grasp video content from the GIF without watching the full video.

Future Work

Improvements are suggested for video segmentation (e.g., using advanced action‑recognition methods) and feature extraction (e.g., two‑stream networks) to further enhance GIF quality.

Deep Learningcontent recommendationvideo segmentationVideo Summarization3D CNNGIF generationranking loss
NetEase Media Technology Team
Written by

NetEase Media Technology Team

NetEase Media Technology Team

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.