Apr 17, 2024 · Artificial Intelligence

Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning

The paper presented at AAAI introduces the EERCF method, a coarse‑to‑fine visual representation and two‑stage recall‑then‑rerank strategy that dramatically reduces cross‑modal matching FLOPs while preserving state‑of‑the‑art retrieval performance on multiple video benchmarks.

AIEfficiencyMultimodal Learning

0 likes · 8 min read

Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning

text-to-video retrieval

Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning