How AI Generates Synchronized Video Narrations for E‑Commerce

This article presents the research behind Synchronized Video Storytelling, introducing the E‑SyncVidStory dataset, the VideoNarrator multimodal architecture, and extensive experiments that demonstrate high‑quality, product‑aware video narration generation for e‑commerce applications.

Alimama Tech
Alimama Tech
Alimama Tech
How AI Generates Synchronized Video Narrations for E‑Commerce

Background

In the short‑video era, manual editing is time‑consuming and relies on expert experience. Alibaba’s Intelligent Creation team explored automated video editing by integrating video understanding, structure analysis, and narration generation to produce coherent, product‑focused videos that improve viewer engagement.

Task Definition

Synchronized Video Storytelling requires generating a product‑related narration given product information, video footage, and a predefined storyline. The narration must be rich, coherent, and aligned with visual content.

Dataset (E‑SyncVidStory)

Data Collection

High‑click‑rate advertising videos were selected. For each video, product metadata (name, ingredients, usage, etc.) was extracted, videos were segmented into event‑based clips, and automatic speech recognition (ASR) provided raw subtitles. GPT‑4 corrected ASR errors and assigned script tags, followed by crowdsourced verification.

Statistics and Analysis

The dataset contains 6,032 videos and 41,292 clips covering categories such as personal care, fashion, home goods, maternity, and electronics. Clip lengths range from 2 s to 220 s (average 39 s) with an average transcript length of 194 words. A large variance in clip count per video motivated the use of relative‑position embeddings.

Methodology

Pre‑processing

Product information is parsed to extract key attributes. Video streams are analyzed with event detection and transition detection to produce coherent clip sequences.

VideoNarrator Architecture

The system consists of four main modules:

Video Input : Frames from each clip are converted to image embeddings and concatenated across clips.

Feature Extraction : Visual tokens are projected together with relative‑position embeddings via a Video Projector, producing a unified video embedding.

Memory Integration : For long videos, earlier visual information is compressed into fixed‑length memory tokens to keep context manageable.

LLM Prompting : A large language model receives visual embeddings, product knowledge, and script tags to generate the narration, guided by prompts that restrict the model to use only provided information.

Detailed Design

Visual Encoder: CLIP‑L/14 encodes selected frames.

Visual Projection: BLIP‑2’s Q‑Former extracts aligned multimodal features.

Video Token Concatenation: Previous clip embeddings are prepended to the current clip to preserve temporal context.

Video Compression: Adjacent frames with similarity above a threshold τ are removed iteratively.

Position Embedding: Each clip receives a relative position p% (its location in the video), which is embedded and added to the visual token.

LLM Prompting: Prompts emphasize using product‑specific knowledge and discourage hallucinations.

Experiments

Comparison with Baselines

VideoNarrator was compared against multimodal large models (LLaVA + GPT‑3.5), end‑to‑end MLLMs (Video‑ChatGPT, Video‑LLaVA, VTimeLLM) in both zero‑shot/few‑shot and fine‑tuned settings. Quantitative results show superior performance across automatic metrics.

Human Evaluation

Three criteria—visual relevance, attractiveness, and coherence—were assessed by human judges. VideoNarrator consistently outperformed GPT‑4V + GPT‑4 on all metrics.

General‑Domain Evaluation

To verify cross‑domain applicability, the model was tested on a generic video storytelling dataset after minor adjustments. It achieved state‑of‑the‑art results, confirming its versatility beyond e‑commerce.

Qualitative Results

Sample outputs demonstrate that the generated narration aligns tightly with visual content and follows the product‑centric storyline, matching the style of the E‑SyncVidStory annotations.

Conclusion and Future Work

The proposed VideoNarrator enables automated, high‑quality video narration for e‑commerce, facilitating faster video production and richer ad experiences. Remaining challenges include mitigating hallucinations and improving instruction following. Future work will explore stronger constraints and more sophisticated video understanding modules to further enhance narration quality.

e-commercemultimodal AILLMvideo understandingdatasetvideo narration
Alimama Tech
Written by

Alimama Tech

Official Alimama tech channel, showcasing all of Alimama's technical innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.