How AI Generates Synchronized Video Narrations for E‑Commerce
This article presents the research behind Synchronized Video Storytelling, introducing the E‑SyncVidStory dataset, the VideoNarrator multimodal architecture, and extensive experiments that demonstrate high‑quality, product‑aware video narration generation for e‑commerce applications.
Background
In the short‑video era, manual editing is time‑consuming and relies on expert experience. Alibaba’s Intelligent Creation team explored automated video editing by integrating video understanding, structure analysis, and narration generation to produce coherent, product‑focused videos that improve viewer engagement.
Task Definition
Synchronized Video Storytelling requires generating a product‑related narration given product information, video footage, and a predefined storyline. The narration must be rich, coherent, and aligned with visual content.
Dataset (E‑SyncVidStory)
Data Collection
High‑click‑rate advertising videos were selected. For each video, product metadata (name, ingredients, usage, etc.) was extracted, videos were segmented into event‑based clips, and automatic speech recognition (ASR) provided raw subtitles. GPT‑4 corrected ASR errors and assigned script tags, followed by crowdsourced verification.
Statistics and Analysis
The dataset contains 6,032 videos and 41,292 clips covering categories such as personal care, fashion, home goods, maternity, and electronics. Clip lengths range from 2 s to 220 s (average 39 s) with an average transcript length of 194 words. A large variance in clip count per video motivated the use of relative‑position embeddings.
Methodology
Pre‑processing
Product information is parsed to extract key attributes. Video streams are analyzed with event detection and transition detection to produce coherent clip sequences.
VideoNarrator Architecture
The system consists of four main modules:
Video Input : Frames from each clip are converted to image embeddings and concatenated across clips.
Feature Extraction : Visual tokens are projected together with relative‑position embeddings via a Video Projector, producing a unified video embedding.
Memory Integration : For long videos, earlier visual information is compressed into fixed‑length memory tokens to keep context manageable.
LLM Prompting : A large language model receives visual embeddings, product knowledge, and script tags to generate the narration, guided by prompts that restrict the model to use only provided information.
Detailed Design
Visual Encoder: CLIP‑L/14 encodes selected frames.
Visual Projection: BLIP‑2’s Q‑Former extracts aligned multimodal features.
Video Token Concatenation: Previous clip embeddings are prepended to the current clip to preserve temporal context.
Video Compression: Adjacent frames with similarity above a threshold τ are removed iteratively.
Position Embedding: Each clip receives a relative position p% (its location in the video), which is embedded and added to the visual token.
LLM Prompting: Prompts emphasize using product‑specific knowledge and discourage hallucinations.
Experiments
Comparison with Baselines
VideoNarrator was compared against multimodal large models (LLaVA + GPT‑3.5), end‑to‑end MLLMs (Video‑ChatGPT, Video‑LLaVA, VTimeLLM) in both zero‑shot/few‑shot and fine‑tuned settings. Quantitative results show superior performance across automatic metrics.
Human Evaluation
Three criteria—visual relevance, attractiveness, and coherence—were assessed by human judges. VideoNarrator consistently outperformed GPT‑4V + GPT‑4 on all metrics.
General‑Domain Evaluation
To verify cross‑domain applicability, the model was tested on a generic video storytelling dataset after minor adjustments. It achieved state‑of‑the‑art results, confirming its versatility beyond e‑commerce.
Qualitative Results
Sample outputs demonstrate that the generated narration aligns tightly with visual content and follows the product‑centric storyline, matching the style of the E‑SyncVidStory annotations.
Conclusion and Future Work
The proposed VideoNarrator enables automated, high‑quality video narration for e‑commerce, facilitating faster video production and richer ad experiences. Remaining challenges include mitigating hallucinations and improving instruction following. Future work will explore stronger constraints and more sophisticated video understanding modules to further enhance narration quality.
Alimama Tech
Official Alimama tech channel, showcasing all of Alimama's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
