Unlocking Long-Text Video Understanding and LLM Distillation with Alibaba PAI

Alibaba Cloud’s AI platform PAI recently saw two papers accepted at EMNLP2024—VideoCLIP‑XL, which enhances video‑text representation for long descriptions using a large video‑long‑description dataset and novel pre‑training tasks, and TAPIR, a curriculum‑planning framework that distills instruction‑following abilities of large language models—while also releasing associated models, datasets, and integration tools for users.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Unlocking Long-Text Video Understanding and LLM Distillation with Alibaba PAI

Recently, several papers from Alibaba Cloud’s AI platform PAI, co‑developed with South China University of Technology and Fudan University, were accepted at the top‑tier natural language processing conference EMNLP2024, highlighting the platform’s research recognition.

VideoCLIP‑XL: Long‑Text Video Representation and Retrieval

The original CLIP model struggles with long textual descriptions. To address this, the authors built a large video‑long‑description paired dataset (VILD) and introduced a text‑similarity‑guided principal component matching method (TPCM) for pre‑training. Two new pre‑training tasks—Detail Description Ranking (DDR) and Hallucination Description Ranking (HDR)—encourage the model to assign higher scores to richer, more accurate descriptions and to penalize hallucinated ones. A new benchmark, LVDR, evaluates video‑CLIP performance on long‑description ranking.

TAPIR: Task‑Aware Curriculum Planning for LLM Distillation

The TAPIR framework distills instruction‑following abilities from powerful black‑box teacher LLMs (e.g., GPT‑4, Qwen‑max) by explicitly modeling task diversity and difficulty. It resamples hard instructions identified by the teacher, adjusts multi‑task ratios, and introduces a Model‑Fitting‑Difficulty (MFD) metric to increase the proportion of challenging mathematical‑reasoning and code tasks. Experiments show the distilled LLaMA‑2‑7B model achieving a 7.8 relative score on AlpacaEval, surpassing larger open‑source models, and improving Qwen‑1.5 series by 3–8 points.

Productization and Resources

The research outcomes have been integrated into PAI modules. VideoCLIP‑XL is offered as a video‑text quality assessment component that works with the EasyAnimate video generation solution, and a notebook for extracting cross‑modal features from ultra‑long texts is available. Distillation models based on Qwen‑2 have also been released (DistilQwen2 series), providing easy‑to‑use LLM distillation services.

EasyAnimate: https://github.com/aigc-apps/EasyAnimate

VideoCLIP‑XL: https://huggingface.co/alibaba-pai/VideoCLIP-XL

VideoCLIP‑XL‑v2: https://huggingface.co/alibaba-pai/VideoCLIP-XL-v2

LVDR dataset: https://huggingface.co/alibaba-pai/LVDR

VILD dataset: https://huggingface.co/alibaba-pai/VILD

VideoCLIP‑XL demo: https://gallery.pai-ml.com/#/preview/deepLearning/cv/videoclipxl

DistilQwen2‑7B‑Instruct: https://huggingface.co/alibaba-pai/DistilQwen2-7B-Instruct

DistilQwen2‑1.5B‑Instruct: https://huggingface.co/alibaba-pai/DistilQwen2-1.5B-Instruct

Paper Details

VideoCLIP‑XL: Advancing Long Description Understanding for Video CLIP Models Authors: Wang Jiapeng, Wang Chengyu, Huang Kunzhe, Huang Jun, Jin Lianwen PDF: https://arxiv.org/abs/2410.00741

Distilling Instruction‑following Abilities of Large Language Models with Task‑aware Curriculum Planning Authors: Yue Yuanhao, Wang Chengyu, Huang Jun, Wang Peng PDF: https://arxiv.org/abs/2405.13448

multimodalDistillationEMNLP2024video-languagelarge-language-models
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.