VisTR: End-to-End Video Instance Segmentation with Transformers

VisTR redefines video instance segmentation as an end‑to‑end sequence‑to‑sequence task, using a CNN backbone, Transformer encoder‑decoder with instance queries, and Hungarian matching to jointly predict masks, classes, and tracks across frames, achieving state‑of‑the‑art accuracy (40.1 AP) and 57.7 FPS on YouTube‑VIS.

Meituan Technology Team
Meituan Technology Team
Meituan Technology Team
VisTR: End-to-End Video Instance Segmentation with Transformers

Background

Instance segmentation on static images is a core computer‑vision task. Video Instance Segmentation (VIS) extends this problem to video streams, requiring detection, segmentation, and tracking of objects across frames. The richer temporal information in videos makes modeling more challenging but also more valuable for real‑world applications such as autonomous driving and online media.

Related Work

Early VIS methods such as MaskTrack R‑CNN, MaskProp and STEm‑Seg treat the problem as a cascade of separate modules (single‑frame segmentation + post‑hoc tracking). These pipelines are complex, slow, and do not fully exploit temporal continuity.

VisTR Algorithm Introduction

Problem Redefinition

VisTR reformulates VIS as a sequence‑to‑sequence (Seq2Seq) prediction task: given a clip of multiple frames, the model directly outputs a sequence of mask predictions. This unifies instance segmentation and tracking within a single similarity‑learning framework.

Algorithm Flow

1. A CNN backbone extracts per‑frame features. 2. Features from all frames are flattened into a spatio‑temporal sequence and fed to a Transformer encoder. 3. A set of learnable Instance Queries is processed by the Transformer decoder to produce instance‑level embeddings. 4. Instance Sequence Matching aligns predicted instance sequences with ground‑truth sequences using Hungarian matching. 5. Instance Sequence Segmentation converts instance embeddings into mask sequences via self‑attention and 3D convolutions.

Network Structure

The architecture consists of:

Backbone : CNN for initial feature extraction and positional encoding.

Encoder : Transformer encoder that models long‑range spatio‑temporal dependencies.

Decoder : Transformer decoder with Instance Queries to decode instance predictions.

Instance Sequence Matching : Supervises the order of instance predictions across frames.

Instance Sequence Segmentation : Generates final mask sequences using attention‑based masks and 3D convolutions.

Loss Functions

The overall loss combines:

Matching loss (classification + bounding‑box regression) for Instance Sequence Matching.

Segmentation loss (Dice + Focal) for mask prediction.

A weighted sum of classification, box, and mask losses for end‑to‑end training.

Experiments

VisTR is evaluated on the YouTube‑VIS benchmark (2238 training, 302 validation, 343 test videos, 40 categories). Metrics include AP and AR.

Temporal Information Importance

Increasing the number of frames in a clip (e.g., from 18 to 36) consistently improves AP, demonstrating the benefit of richer temporal cues.

Query Study

Experiments compare different query sharing strategies: Prediction‑Level (one query per object per frame), Instance‑Level (one query per object across frames), and Frame‑Level (one query per frame). Instance‑Level queries achieve near‑optimal performance while reducing the number of queries.

Other Design Choices

Adding positional encoding, using Transformer‑encoded features instead of CNN‑encoded ones, and incorporating 3‑D convolutions each provide ~1‑5 AP gains.

Visualization and Comparison

Qualitative results show robust segmentation and tracking under occlusion, motion, and appearance changes. Compared with prior methods, VisTR attains the highest single‑model AP (40.1) while running at 57.7 FPS (pure inference).

Conclusion

VisTR introduces the first Transformer‑based end‑to‑end framework for video instance segmentation, unifying detection, segmentation, and tracking. It achieves state‑of‑the‑art accuracy and speed, and the code and paper are publicly available.

For more details see the original paper End-to-End Video Instance Segmentation with Transformers and the GitHub repository https://github.com/Epiphqny/VisTR .

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

TransformerVideo Instance SegmentationVisTR
Meituan Technology Team
Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.