Artificial Intelligence 13 min read

How VideoCLIP‑XL Boosts Long‑Description Understanding in Video CLIP Models

VideoCLIP‑XL, a new video CLIP model introduced by Alibaba Cloud AI Platform and Sun Yat‑sen University, enhances long‑text description comprehension through a large‑scale VILD dataset, a text‑similarity guided principal component matching method, and novel DDR and HDR ranking tasks, achieving superior performance on multiple video‑text benchmarks.

Alibaba Cloud Big Data AI Platform

Nov 7, 2024

How VideoCLIP‑XL Boosts Long‑Description Understanding in Video CLIP Models

Background

Contrastive Language‑Image Pre‑training (CLIP) has advanced vision‑language pre‑training, but its text encoder is limited to a maximum position‑embedding length of 77 tokens, with an effective length of about 20 tokens. This restriction hampers the ability to process long textual descriptions, causing models to overlook fine‑grained details.

Video‑Long Description Dataset VILD

To address data scarcity, a large‑scale dataset called VILD was constructed using an automated pipeline that aggregates multiple sources:

Video narration data : VidLN provides individual‑level descriptions that are merged into holistic narratives via large language models (LLM) and further rewritten for diversity.

Video instruction fine‑tuning data : Public datasets such as VideoInstruct100K and VideoChat are filtered with LLMs to retain only description‑relevant samples, followed by description rewriting.

Available video‑long description pairs : MiraData (57.8 k game and city scenes) and Open‑Sora (50 k natural‑scenery descriptions) are sampled.

Raw video data : From Panda‑70M, 2 M video clips are sampled; three keyframes per clip are annotated with long descriptions using multimodal models (LMM) and LLMs to combine short titles and frame‑level captions.

Low‑quality pairs with video‑text similarity below 0.20 (filtered by ViCLIP and Long‑CLIP) are removed, resulting in over 2 M high‑quality video‑long description pairs.

Text‑Similarity Guided Principal Component Matching (TPCM)

Standard CLIP pre‑training uses contrastive learning with an InfoNCE loss on visual‑text pairs. Long‑CLIP introduced principal component matching (PCM) that decomposes features (F), filters less important components (E), and reconstructs (F⁻¹), keeping the top 32 attributes. For video, a fixed set of attributes is insufficient, so VideoCLIP‑XL employs the cosine similarity between text features and visual embeddings as a guidance signal to dynamically select components during training.

Description Ranking Tasks

Two new pre‑training tasks are proposed to encourage models to prefer richer, more accurate descriptions:

Detail Description Ranking (DDR) : Randomly delete clauses, adjectives, numbers, or sub‑trees from a long description, generating a sequence of increasingly less detailed texts.

Hallucination Description Ranking (HDR) : Replace specific words (nouns, numbers, colors, directions, verbs) with semantically different alternatives within the same syntactic category, creating a series of descriptions with escalating hallucination.

Both tasks use syntactic analysis tools (NLTK, spaCy) and are trained with margin‑based ranking losses that reward higher similarity scores for earlier (more detailed or less hallucinated) descriptions.

Video Long Description Ranking Benchmark (LVDR)

To evaluate long‑description understanding, the LVDR benchmark is built from 2 000 video‑description pairs sampled from Shot2Story. For each video, multiple descriptions are generated using the HDR procedure, varying the number of altered words (p × q). Five subsets ({4 × 1, 4 × 2, 4 × 3, 4 × 4, 4 × 5}) are created, each containing descriptions with progressively higher hallucination levels.

Experimental Results

VideoCLIP‑XL achieves state‑of‑the‑art zero‑shot and fine‑tuned performance on several text‑video retrieval datasets, outperforming comparable models. Notable results include:

Zero‑shot retrieval on standard benchmarks (tables omitted for brevity).

Fine‑tuned retrieval improvements across all evaluated datasets.

Superior zero‑shot performance on the Shot2Story long‑description benchmark.

Consistently higher scores on the LVDR benchmark, demonstrating effective long‑description ranking.

Qualitative examples further illustrate the model’s ability to retrieve accurate video‑text matches.

References

Alec Radford et al., “Learning transferable visual models from natural language supervision,” ICML 2021.

Yi Wang et al., “InternVid: A large‑scale video‑text dataset for multimodal understanding and generation,” arXiv 2023.

Beichen Zhang et al., “Long‑CLIP: Unlocking the long‑text capability of CLIP,” arXiv 2024.

Mingfei Han et al., “Shot2Story20K: A new benchmark for comprehensive understanding of multi‑shot videos,” arXiv 2023.

Paper Information

Title: VideoCLIP‑XL: Advancing Long Description Understanding for Video CLIP Models Authors: Jia‑peng Wang, Cheng‑yu Wang, Kun‑zhe Huang, Jun Huang, Lian‑wen Jin PDF: https://arxiv.org/abs/2410.00741

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

benchmark Multimodal Learning dataset Long Description Video CLIP

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.