Glance Supervised Video Moment Retrieval via the ViGA Framework

The paper presents a glance‑supervised video moment retrieval approach that records a single annotator‑seen frame, introduces the ViGA contrastive learning framework to leverage this weak temporal cue, and demonstrates on three benchmarks performance rivaling fully supervised methods while keeping annotation cost minimal.

Bilibili Tech
Bilibili Tech
Bilibili Tech
Glance Supervised Video Moment Retrieval via the ViGA Framework

Author: Cui Ran, AI Platform Department, algorithm intern, focusing on the latest video content understanding algorithms for Bilibili.

Introduction: Bilibili's AI Platform not only provides AI support for the main site but also conducts frontier research on video algorithms. Their work titled "Video Moment Retrieval from Text Queries via Single Frame Annotation" was accepted at SIGIR 2022. The paper proposes a new paradigm for video moment retrieval based on frame‑level annotations and introduces an effective contrastive learning framework.

All annotations, code, and pretrained models are open‑source (paper: https://arxiv.org/abs/2204.09409, code: https://github.com/r-cui/ViGA).

1. Video Moment Retrieval (VMR) Task Overview

VMR aims to locate a video segment that semantically matches a natural‑language query. It differs from Video Action Localization (VAL) by using free‑form language instead of predefined action categories, making it a more challenging task. Early works (2017) used fully supervised data with precise temporal boundaries, while weakly supervised VMR (2019) only required video‑query pairs without explicit timestamps.

2. Glance Supervised VMR

Observing that annotators must watch the video to write a query, the authors propose to record a single arbitrary frame (“glance”) that the annotator has seen. Each training sample therefore contains the video, the query, and one timestamp of a glanced frame. This adds negligible annotation cost while providing additional supervision compared to weakly supervised VMR.

The approach reduces labeling effort relative to full supervision and supplies useful temporal cues.

3. ViGA: A Contrastive Learning Framework

Training

Because precise start‑end times are unavailable, ViGA adopts a contrastive learning strategy similar to weak supervision. The video is split into clips via a sliding window. A Gaussian prior centered at the glanced timestamp assigns higher weights to clips near the glance. The model learns joint vision‑language embeddings using an InfoNCE loss and an additional attention loss that encourages the encoder to focus on the glanced region.

Inference

During inference, the trained model’s attention map identifies high‑attention regions, which serve as anchors for generating proposals of varying lengths. The proposal with the highest similarity to the query embedding is selected as the final output.

4. Experimental Results

Experiments on three standard VMR datasets (Charades‑STA, ActivityNet Captions, TACoS) show that ViGA outperforms state‑of‑the‑art weakly supervised methods and is comparable to several fully supervised approaches. The authors note that ViGA is an initial exploration and encourage further improvements on the glance‑supervised setting.

5. Visual Examples

Illustrative cases include two successful retrievals and one failure, demonstrating the strengths and limitations of the proposed method.

References

[1] Lisa Anne Hendricks et al., "Localizing moments in video with natural language," CVPR 2017.

[2] Jiyang Gao et al., "TALL: Temporal activity localization via language query," CVPR 2017.

[3] Niluthpol Chowdhury Mithun et al., "Weakly supervised video moment retrieval from text queries," CVPR 2019.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Computer Visioncontrastive learningmultimodalGlance SupervisionVideo Moment RetrievalViGA
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.