Artificial Intelligence 9 min read

How X2SAM Empowers Multimodal Models to Segment Images and Videos at Pixel Level

X2SAM is a unified multimodal large model that combines image and video segmentation with language and visual prompts, introduces a Mask Memory for temporal consistency, defines a new V‑VGD task, and achieves state‑of‑the‑art results while cutting training cost by over 30%.

Machine Heart

May 15, 2026

How X2SAM Empowers Multimodal Models to Segment Images and Videos at Pixel Level

Recent advances in multimodal large models enable them to interpret images and videos and answer complex questions, but precise pixel‑level segmentation of arbitrary targets remains challenging. Users may ask a model to isolate a specific object across all frames of a video, requiring both natural‑language understanding and consistent frame‑wise localization.

To address this, researchers from Sun Yat‑sen University and Meituan propose X2SAM, a unified framework that integrates a multimodal large model, a region‑sampling module, a Mask Encoder, a Mask Decoder, and a Mask Memory. After a visual encoder extracts image features, the multimodal model processes textual instructions, visual cues, and context to generate a target representation. The Mask Encoder refines visual features, and the Mask Decoder produces pixel‑level masks. For video input, Mask Memory stores target information from previous frames and supplies temporal references, enabling stable segmentation despite motion, occlusion, or deformation.

Users can provide textual prompts such as "the athlete sliding down" or visual cues like points, boxes, or region selections, and X2SAM outputs the corresponding segmentation mask. The model supports a wide range of tasks within the same framework: generic segmentation, open‑vocabulary segmentation, referential expression segmentation, inference segmentation, dialogue‑driven segmentation, visual grounding segmentation, and object‑level segmentation for both images and videos.

A novel video visual grounding task, V‑VGD (Video Visual Grounded Segmentation), is introduced. In V‑VGD, a single point or box placed on the first visible frame of a target must guide segmentation of that target throughout the entire video. The authors built a dataset based on YT‑VIS19 and VIPSeg, providing visual prompts in the initial frame and requiring continuous segmentation in subsequent frames, a capability valuable for video editing, automatic annotation, and intelligent retrieval.

Experimental results show that X2SAM maintains strong performance on image tasks while achieving significant gains on video tasks. On the ADE20K open‑vocabulary image segmentation benchmark, X2SAM surpasses previous state‑of‑the‑art methods. For video tasks, it reaches 60.3 AP on video open‑vocabulary segmentation, 69.9 J&F on video inference segmentation (a 14.2‑point improvement), 75.8 mIoU on video dialogue generation segmentation, and consistently outperforms baselines on the newly proposed V‑VGD task.

Training efficiency is also improved: the unified training strategy reduces GPU hours from roughly 5.2K to 3.3K, a 36.5 % reduction, demonstrating that joint image‑video segmentation does not necessarily require linearly increased computational resources.

Despite these advances, challenges remain. Joint training still demands high memory and compute, especially for video data. The current Mask Memory has a fixed length, limiting performance on very long videos or prolonged occlusions. Moreover, as a generalist model, X2SAM may lag behind highly specialized expert models on niche tasks. Future work will explore more efficient training, lighter architectures, and longer‑range memory mechanisms to enhance stability and scalability in complex video scenarios.

In summary, X2SAM unifies image segmentation, video segmentation, language understanding, visual prompting, and temporal memory within a single multimodal framework, enabling pixel‑level perception for a broad set of applications such as video editing, automatic labeling, embodied AI, robotic perception, and multimodal interaction.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

computer vision large language model video segmentation mask memory multimodal segmentation V-VGD X2SAM

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.