Artificial Intelligence 8 min read

Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning

The paper presented at AAAI introduces the EERCF method, a coarse‑to‑fine visual representation and two‑stage recall‑then‑rerank strategy that dramatically reduces cross‑modal matching FLOPs while preserving state‑of‑the‑art retrieval performance on multiple video benchmarks.

Kuaishou Tech
Kuaishou Tech
Kuaishou Tech
Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning

At the recent AAAI conference, the Kuaishou commercial algorithm team had their paper "Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning" accepted, and the work has already been deployed in production with notable gains.

Paper details: Authors: Kaibin Tian, Yanhua Cheng, Yi Liu, Xinglin Hou, Quan Chen, Han Li. arXiv link . Code: GitHub repository .

Research background: The rapid growth of short‑video platforms (Kuaishou, TikTok, YouTube) and AIGC video content makes text‑to‑video retrieval a core multimodal task. While CLIP‑based models such as CLIP4Clip have transferred image‑text pre‑training to video, they rely on simple mean‑pooling of frames, which is inefficient and ignores the temporal redundancy of video.

Existing high‑performing methods either add heavy fusion modules or use fine‑grained frame/patch representations, which improve accuracy but dramatically increase online matching cost.

Key contributions:

Introduce a parameter‑free Text‑Gate Interaction Block (TIB) that learns fine‑grained video representations guided by text, combined with an inter‑feature contrast loss and an intra‑feature Pearson constraint to improve cross‑modal alignment.

Propose a two‑stage retrieval pipeline (recall‑then‑rerank) that first uses coarse video embeddings for fast top‑k recall and then refines the ranking with the fine‑grained TIB embeddings, achieving a near 50× reduction in FLOPs while keeping performance comparable to SOTA.

Representation learning details: TIB takes N visual tokens (frame‑level or patch‑level) and a text embedding, computes similarity‑based weights, and aggregates them into a video embedding; a temperature hyper‑parameter controls the sharpness of the weighting. The inter‑feature contrast loss uses InfoNCE on video‑text pairs, while the intra‑feature Pearson constraint enforces channel‑wise correlation between matching video and text channels, reducing cross‑modal noise and stabilizing training.

Two‑stage strategy: Three levels of video representation are generated: a coarse video‑level embedding (not text‑driven) for rapid recall, and the TIB‑produced frame‑level and patch‑level embeddings for reranking the retrieved candidates.

Experimental results: On four benchmarks (MSRVTT‑1K‑Test, MSRVTT‑3K‑Test, VATEX, ActivityNet) EERCF matches SOTA accuracy while requiring only 1/14, 1/39, 1/20, and 1/126 of the FLOPs respectively. Tables and figures in the original article illustrate the FLOP savings and the progressive improvement of ranking quality when moving from coarse to fine representations.

For further reading, the article recommends additional related posts (links provided in the original source).

EfficiencyAImultimodal learningcoarse-to-fine representationtext-to-video retrieval
Kuaishou Tech
Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.