Glance Supervised Video Moment Retrieval via the ViGA Framework
The paper presents a glance‑supervised video moment retrieval approach that records a single annotator‑seen frame, introduces the ViGA contrastive learning framework to leverage this weak temporal cue, and demonstrates on three benchmarks performance rivaling fully supervised methods while keeping annotation cost minimal.
Author: Cui Ran, AI Platform Department, algorithm intern, focusing on the latest video content understanding algorithms for Bilibili.
Introduction: Bilibili's AI Platform not only provides AI support for the main site but also conducts frontier research on video algorithms. Their work titled "Video Moment Retrieval from Text Queries via Single Frame Annotation" was accepted at SIGIR 2022. The paper proposes a new paradigm for video moment retrieval based on frame‑level annotations and introduces an effective contrastive learning framework.
All annotations, code, and pretrained models are open‑source (paper: https://arxiv.org/abs/2204.09409, code: https://github.com/r-cui/ViGA).
1. Video Moment Retrieval (VMR) Task Overview
VMR aims to locate a video segment that semantically matches a natural‑language query. It differs from Video Action Localization (VAL) by using free‑form language instead of predefined action categories, making it a more challenging task. Early works (2017) used fully supervised data with precise temporal boundaries, while weakly supervised VMR (2019) only required video‑query pairs without explicit timestamps.
2. Glance Supervised VMR
Observing that annotators must watch the video to write a query, the authors propose to record a single arbitrary frame (“glance”) that the annotator has seen. Each training sample therefore contains the video, the query, and one timestamp of a glanced frame. This adds negligible annotation cost while providing additional supervision compared to weakly supervised VMR.
The approach reduces labeling effort relative to full supervision and supplies useful temporal cues.
3. ViGA: A Contrastive Learning Framework
Training
Because precise start‑end times are unavailable, ViGA adopts a contrastive learning strategy similar to weak supervision. The video is split into clips via a sliding window. A Gaussian prior centered at the glanced timestamp assigns higher weights to clips near the glance. The model learns joint vision‑language embeddings using an InfoNCE loss and an additional attention loss that encourages the encoder to focus on the glanced region.
Inference
During inference, the trained model’s attention map identifies high‑attention regions, which serve as anchors for generating proposals of varying lengths. The proposal with the highest similarity to the query embedding is selected as the final output.
4. Experimental Results
Experiments on three standard VMR datasets (Charades‑STA, ActivityNet Captions, TACoS) show that ViGA outperforms state‑of‑the‑art weakly supervised methods and is comparable to several fully supervised approaches. The authors note that ViGA is an initial exploration and encourage further improvements on the glance‑supervised setting.
5. Visual Examples
Illustrative cases include two successful retrievals and one failure, demonstrating the strengths and limitations of the proposed method.
References
[1] Lisa Anne Hendricks et al., "Localizing moments in video with natural language," CVPR 2017.
[2] Jiyang Gao et al., "TALL: Temporal activity localization via language query," CVPR 2017.
[3] Niluthpol Chowdhury Mithun et al., "Weakly supervised video moment retrieval from text queries," CVPR 2019.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
