Artificial Intelligence 14 min read

Multimodal Video High‑Energy Segment Extraction for Dynamic Video Covers

The authors present a multimodal system that automatically extracts high‑energy video segments for dynamic covers by analyzing subtitles, audio, visual frames, and danmu, employing LLM prompt‑tuning, scene‑cut detection, and aesthetic scoring to reduce manual effort and boost click‑through rates.

Bilibili Tech
Bilibili Tech
Bilibili Tech
Multimodal Video High‑Energy Segment Extraction for Dynamic Video Covers

When users browse Bilibili, video thumbnails are usually static images extracted from a single frame. Static covers are simple but limited in conveying the video’s dynamic content and emotional tone.

Dynamic video covers, which play short video clips, are more attractive, provide a realistic preview, and better convey story emotions, leading to higher click‑through rates. However, creating them currently relies on manual effort or on large amounts of user interaction data, which restricts their applicability.

To address this, the authors propose a multimodal high‑energy point extraction technology. The system automatically analyzes multiple modalities of a video—subtitles, audio, visual frames, and user danmu—to locate several high‑energy moments and select the most suitable clip for a dynamic cover, with minimal human intervention.

The technical pipeline consists of:

Subtitle acquisition: use external subtitles when available; otherwise apply OCR or ASR to extract subtitles from the video.

Pre‑processing and feeding the subtitles into a large language model (LLM) to generate a set of candidate high‑energy subtitles and their timestamps.

Candidate clip selection: evaluate the clips with aesthetic scoring, dynamic scene analysis (using a Scenecut‑based algorithm), and high‑energy danmu information from the database.

Final clip generation: optionally generate textual summaries for the selected segment.

Key technologies include:

Subtitle extraction module : OCR pipeline (frame extraction → text detection → recognition → post‑processing → timestamp alignment) and ASR pipeline (audio extraction → preprocessing → speech recognition → post‑processing → timestamp alignment).

LLM adaptation : Instead of full fine‑tuning, the authors employ P‑tuning (prompt tuning) where the LLM’s parameters are frozen and only the prompt encoder is optimized, drastically reducing required parameter updates.

Scene dynamic analysis : Count scene cuts using a threshold‑based Scenecut algorithm; higher cut frequency indicates more dynamic content.

Aesthetic scoring : Sample key frames from each clip, score them with a trained aesthetic model (0–10), and discard clips with average scores below 4.

High‑energy danmu : Leverage user‑generated danmu data to identify popular moments, especially for high‑traffic videos.

The system’s effectiveness is demonstrated with three example clips from popular videos (e.g., "Wandering Earth", "Amazing Animals", "Life a String"), showing that the automatically selected segments are both visually engaging and contextually representative.

In summary, the multimodal approach enables fast, accurate extraction of high‑energy video segments for dynamic covers, reducing production cost and improving user experience. The authors also envision broader applications such as video chapter segmentation for long‑form content and automatic outline generation for live‑commerce streams.

multimodal AIOCRlarge language modelASRdynamic coverscene analysisVideo Summarization
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.