Video Highlight Analysis Technology Framework
iQIYI’s video highlight analysis framework combines a large supervised dataset, deep label distribution learning, multi‑task training with a canonical‑correlated autoencoder, and a weakly supervised ranking model enhanced by confidence weighting and graph convolution, then fuses these signals to improve highlight detection accuracy.
With the explosive growth of video content, especially short‑video platforms, selecting attractive video segments to improve user experience has become a key research problem. This document describes a comprehensive video highlight analysis solution developed for iQIYI, covering data collection, supervised and weakly supervised modeling, multi‑task learning, and the integration of additional contextual signals.
The supervised pipeline is built on a large annotated dataset of over 500,000 10‑second clips from more than 5,000 movies, TV series, and variety shows. Each clip is scored on a 0‑10 highlight scale and labeled with multi‑dimensional tags (scene, behavior, emotion, dialogue, etc.). To handle the dataset’s size and label noise, the authors employ transfer learning with pretrained 3D CNNs (I3D) fine‑tuned on kinetics‑400/600, fuse visual and audio features (Vggish), and adopt a Deep Label Distribution Learning (DLDL) loss that models the score as a Gaussian distribution rather than a point estimate.
For multi‑label tag classification, a Canonical‑Correlated Autoencoder (C2AE) is used to learn label embeddings that capture correlations (e.g., “funny” ↔ “laugh”). The loss combines binary cross‑entropy with a correlation loss, improving MAP by 1.1 %.
A multi‑task learning architecture jointly trains the DLDL scoring network and the C2AE tag classifier, sharing a common feature extractor (Fs). This reduces model parameters by ~50 % and lowers mean‑square error by 0.10.
To reduce annotation cost, a weakly supervised model leverages user‑generated clip‑sharing behavior as implicit highlight signals. A ranking loss is applied to pairs of clips with high vs. low share counts. Because the weak labels contain noise, the authors introduce confidence‑weighted weighting and a graph‑convolutional network (GCN) that aggregates features of similar clips, making the model more robust to noisy pairs.
The final system fuses scores from supervised, weakly supervised, celebrity importance, and user interaction features (e.g., fast‑forward/rewind events) to produce a refined highlight score. This fusion improves the classification accuracy of high‑quality segments by about 2 % in production.
Future work includes adding textual features, exploring semi‑supervised joint training of labeled and weakly labeled data, and incorporating additional AI models (object detection, scene classification, speech recognition) to further refine highlight scoring.
iQIYI Technical Product Team
The technical product team of iQIYI
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.