How MuLTI Achieves Memory‑Efficient Video‑Language Understanding with Text‑Guided MultiWay Sampling
The paper presents MuLTI, a multimodal video‑language model that tackles the memory and efficiency challenges of long video‑text sequences by introducing a Text‑Guided MultiWay Sampler and a Multiple Choice Modeling pre‑training task, achieving state‑of‑the‑art results on video QA and retrieval while drastically reducing GPU memory consumption.
Background
Multimodal understanding models are used for tasks such as multi‑label classification, video QA, and text‑video retrieval. Existing methods face two major challenges: balancing efficiency and performance on long sequences, and reducing the domain gap between pre‑training and downstream tasks.
Challenges in Feature Fusion
Typical multimodal models consist of a text encoder, a video encoder, and a feature‑fusion module. The fusion module often incurs high computational cost and memory consumption, especially when concatenating long video and text features, leading to quadratic memory growth.
Proposed Model: MuLTI
MuLTI introduces a Text‑Guided MultiWay Sampler that adaptively pools sequence blocks based on importance scores, allowing efficient compression of video features guided by concise textual cues. The sampler shares attention weights with the fusion module and retains separate feed‑forward networks for each modality.
To bridge the pre‑training‑downstream gap, MuLTI adds a Multiple Choice Modeling (MCM) pre‑training task that constructs four‑option QA pairs from large video‑text datasets, enhancing video‑question answering and retrieval performance.
Model Architecture
The base model uses a 12‑layer ViT‑B/16 video encoder and a 12‑layer BERT text encoder. Video frames are sparsely sampled, split into non‑overlapping patches, and projected to a feature width d. The text encoder outputs a sequence of length L with the same width d. A feature‑fusion module flattens and concatenates video and text features, then applies the Text‑Guided MultiWay Sampler and an Adapt‑Pooling layer to obtain compressed representations.
Pre‑training Details
MuLTI is pre‑trained on 5.5 M video‑text pairs (WebVid‑2M and CC‑3M) using four objectives: Masked Language Modeling (MLM), Video‑Text Matching (VTM), Video‑Text Comparison (VTC), and the proposed MCM. Training uses AdamW (learning rate = 1e‑4, weight decay = 0.05) on 8 × NVIDIA A100 GPUs for 10 epochs.
Downstream Evaluation
Evaluated on five video‑QA benchmarks (MSRVTT‑QA, MSVD‑QA, TGIF‑Action, TGIF‑Transition, TGIF‑Frames) and two text‑video retrieval datasets (MSRVTT‑R, DiDeMo). MuLTI achieves state‑of‑the‑art results, outperforming prior methods while using substantially less GPU memory.
Ablation Studies
Experiments demonstrate the importance of the Text‑Guided MultiWay Sampler, Adapt‑Pooling, and MCM. Removing any component degrades performance, confirming their contributions to efficient fusion and reduced pre‑training‑downstream gap.
Future Work
Planned extensions include incorporating audio modalities, further reducing ViT FLOPs and memory, and exploring distilled models with comparable performance.
"[CLS]<Question> ? [SEP] Option 1: <Answer 1>. [SEP] Option 2: <Answer 2>. [SEP] Option 3: <Answer 3>. [SEP] Option 4: <Answer 4>."Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
