HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval
This paper proposes HiT, a hierarchical transformer model with momentum contrast for video-text retrieval, addressing limitations in existing multimodal learning methods by introducing hierarchical cross-modal contrast matching and momentum cross-modal contrast to improve retrieval performance.
This paper addresses the challenge of video-text retrieval in the context of increasing multimedia content on the internet. The authors propose HiT (Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval), a model that overcomes limitations in existing multimodal learning methods. HiT adopts a dual-stream transformer framework and introduces two key innovations: Hierarchical Cross-modal Contrast Matching (HCM) and Momentum Cross-modal Contrast (MCC).
HCM leverages the hierarchical nature of transformer networks by performing contrastive matching at both feature and semantic levels, utilizing the different characteristics of lower and higher transformer layers. MCC incorporates the momentum update mechanism from MoCo to enable efficient use of large negative sample queues, overcoming memory limitations of end-to-end training.
The model architecture consists of video and text encoders based on transformer structures, with contrastive matching performed at two network levels using four different retrieval approaches. Experiments demonstrate that HiT achieves state-of-the-art performance on multiple video-text retrieval datasets including MSR-VTT, ActivityNet Captions, and LSMDC. The model has been successfully deployed in various business scenarios at Kuaishou, improving multimodal model representation capabilities for video retrieval, content understanding, and intelligent review applications.
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.