Artificial Intelligence 17 min read

Video Highlight Analysis Technology Framework

iQIYI’s video highlight analysis framework combines a large supervised dataset, deep label distribution learning, multi‑task training with a canonical‑correlated autoencoder, and a weakly supervised ranking model enhanced by confidence weighting and graph convolution, then fuses these signals to improve highlight detection accuracy.

iQIYI Technical Product Team

Jul 10, 2020

Video Highlight Analysis Technology Framework

With the explosive growth of video content, especially short‑video platforms, selecting attractive video segments to improve user experience has become a key research problem. This document describes a comprehensive video highlight analysis solution developed for iQIYI, covering data collection, supervised and weakly supervised modeling, multi‑task learning, and the integration of additional contextual signals.

The supervised pipeline is built on a large annotated dataset of over 500,000 10‑second clips from more than 5,000 movies, TV series, and variety shows. Each clip is scored on a 0‑10 highlight scale and labeled with multi‑dimensional tags (scene, behavior, emotion, dialogue, etc.). To handle the dataset’s size and label noise, the authors employ transfer learning with pretrained 3D CNNs (I3D) fine‑tuned on kinetics‑400/600, fuse visual and audio features (Vggish), and adopt a Deep Label Distribution Learning (DLDL) loss that models the score as a Gaussian distribution rather than a point estimate.

For multi‑label tag classification, a Canonical‑Correlated Autoencoder (C2AE) is used to learn label embeddings that capture correlations (e.g., “funny” ↔ “laugh”). The loss combines binary cross‑entropy with a correlation loss, improving MAP by 1.1 %.

A multi‑task learning architecture jointly trains the DLDL scoring network and the C2AE tag classifier, sharing a common feature extractor (Fs). This reduces model parameters by ~50 % and lowers mean‑square error by 0.10.

To reduce annotation cost, a weakly supervised model leverages user‑generated clip‑sharing behavior as implicit highlight signals. A ranking loss is applied to pairs of clips with high vs. low share counts. Because the weak labels contain noise, the authors introduce confidence‑weighted weighting and a graph‑convolutional network (GCN) that aggregates features of similar clips, making the model more robust to noisy pairs.

The final system fuses scores from supervised, weakly supervised, celebrity importance, and user interaction features (e.g., fast‑forward/rewind events) to produce a refined highlight score. This fusion improves the classification accuracy of high‑quality segments by about 2 % in production.

Future work includes adding textual features, exploring semi‑supervised joint training of labeled and weakly labeled data, and incorporating additional AI models (object detection, scene classification, speech recognition) to further refine highlight scoring.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Multi-Task Learning Weak Supervision video highlight detection graph convolutional networks multimodal analysis

Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.