Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Mar 18, 2024 · Artificial Intelligence

How MuLTI Achieves Memory‑Efficient Video‑Language Understanding with Text‑Guided MultiWay Sampling

The paper presents MuLTI, a multimodal video‑language model that tackles the memory and efficiency challenges of long video‑text sequences by introducing a Text‑Guided MultiWay Sampler and a Multiple Choice Modeling pre‑training task, achieving state‑of‑the‑art results on video QA and retrieval while drastically reducing GPU memory consumption.

efficient-aifeature fusionmultimodal
0 likes · 19 min read
How MuLTI Achieves Memory‑Efficient Video‑Language Understanding with Text‑Guided MultiWay Sampling