Video Object of Interest Segmentation (VOIS): Task, Dataset, and Dual-Path Transformer Approach
The paper presents Video Object of Interest Segmentation (VOIS), a new e‑commerce task that locates and segments video instances matching a given product image, introduces the LiveVideos dataset of 2,418 Taobao live‑stream clips, and proposes a dual‑path Swin‑Transformer with cross‑fusion modules that outperforms existing VOS/VIS baselines.
Video Object of Interest Segmentation (VOIS) is introduced to address the limitation of traditional video object segmentation (VOS) and video instance segmentation (VIS) in e‑commerce scenarios, where a target object image is given and the goal is to detect, track, and segment the corresponding instances in a video.
The task definition requires the model to output all instances that match the provided object image, handling variations in shape, angle, and appearance.
A new dataset, LiveVideos, is constructed from Taobao live‑stream videos and product white‑background images, containing 2 418 video clips, 2 418 product images, 3 341 target objects and 114 k masks. The dataset can also support video retrieval and highlight detection.
The proposed solution follows an encoder‑decoder architecture. A dual‑path Swin‑Transformer extracts video and image features, which are fused by Cross‑Transformer modules inserted at stages 3 and 4. The fused features are fed to a Transformer decoder (inspired by DETR) to generate object queries, followed by bipartite matching and Hungarian loss for training.
Baseline comparisons with adapted MaskTrack R‑CNN and VisTR show that the dual‑path Swin‑Transformer achieves higher Average Precision (AP) and Average Recall (AR). Ablation studies confirm the importance of the image branch and the two Cross‑Transformer fusion stages.
The work demonstrates a versatile video segmentation paradigm that can handle arbitrary video‑image pairs, though future work includes multi‑object image inputs and broader category coverage.
Alimama Tech
Official Alimama tech channel, showcasing all of Alimama's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.