Artificial Intelligence 19 min read

How MuLTI Achieves Memory‑Efficient Video‑Language Understanding with Text‑Guided MultiWay Sampling

The paper presents MuLTI, a multimodal video‑language model that tackles the memory and efficiency challenges of long video‑text sequences by introducing a Text‑Guided MultiWay Sampler and a Multiple Choice Modeling pre‑training task, achieving state‑of‑the‑art results on video QA and retrieval while drastically reducing GPU memory consumption.

Alibaba Cloud Big Data AI Platform

Mar 18, 2024

How MuLTI Achieves Memory‑Efficient Video‑Language Understanding with Text‑Guided MultiWay Sampling

Background

Multimodal understanding models are used for tasks such as multi‑label classification, video QA, and text‑video retrieval. Existing methods face two major challenges: balancing efficiency and performance on long sequences, and reducing the domain gap between pre‑training and downstream tasks.

Challenges in Feature Fusion

Typical multimodal models consist of a text encoder, a video encoder, and a feature‑fusion module. The fusion module often incurs high computational cost and memory consumption, especially when concatenating long video and text features, leading to quadratic memory growth.

Proposed Model: MuLTI

MuLTI introduces a Text‑Guided MultiWay Sampler that adaptively pools sequence blocks based on importance scores, allowing efficient compression of video features guided by concise textual cues. The sampler shares attention weights with the fusion module and retains separate feed‑forward networks for each modality.

To bridge the pre‑training‑downstream gap, MuLTI adds a Multiple Choice Modeling (MCM) pre‑training task that constructs four‑option QA pairs from large video‑text datasets, enhancing video‑question answering and retrieval performance.

Model Architecture

The base model uses a 12‑layer ViT‑B/16 video encoder and a 12‑layer BERT text encoder. Video frames are sparsely sampled, split into non‑overlapping patches, and projected to a feature width d. The text encoder outputs a sequence of length L with the same width d. A feature‑fusion module flattens and concatenates video and text features, then applies the Text‑Guided MultiWay Sampler and an Adapt‑Pooling layer to obtain compressed representations.

Pre‑training Details

MuLTI is pre‑trained on 5.5 M video‑text pairs (WebVid‑2M and CC‑3M) using four objectives: Masked Language Modeling (MLM), Video‑Text Matching (VTM), Video‑Text Comparison (VTC), and the proposed MCM. Training uses AdamW (learning rate = 1e‑4, weight decay = 0.05) on 8 × NVIDIA A100 GPUs for 10 epochs.

Downstream Evaluation

Evaluated on five video‑QA benchmarks (MSRVTT‑QA, MSVD‑QA, TGIF‑Action, TGIF‑Transition, TGIF‑Frames) and two text‑video retrieval datasets (MSRVTT‑R, DiDeMo). MuLTI achieves state‑of‑the‑art results, outperforming prior methods while using substantially less GPU memory.

Ablation Studies

Experiments demonstrate the importance of the Text‑Guided MultiWay Sampler, Adapt‑Pooling, and MCM. Removing any component degrades performance, confirming their contributions to efficient fusion and reduced pre‑training‑downstream gap.

Future Work

Planned extensions include incorporating audio modalities, further reducing ViT FLOPs and memory, and exploring distilled models with comparable performance.

"[CLS]<Question> ? [SEP] Option 1: <Answer 1>. [SEP] Option 2: <Answer 2>. [SEP] Option 3: <Answer 3>. [SEP] Option 4: <Answer 4>."

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

multimodal pretraining feature fusion video-language efficient-ai

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.