Artificial Intelligence 15 min read

CenterMask: Single-Shot Instance Segmentation with Point Representation

CenterMask is a single‑shot, anchor‑free instance segmentation framework that predicts a coarse shape from each object’s center point and a full‑image saliency map, multiplies them to produce precise masks, and achieves competitive COCO AP while running faster than two‑stage methods like Mask R-CNN.

Meituan Technology Team

May 21, 2020

CenterMask: Single-Shot Instance Segmentation with Point Representation

Computer vision is a key technology for autonomous driving, and image instance segmentation is a fundamental problem in this field. The CenterMask algorithm, recently accepted by CVPR 2020, proposes a one‑stage instance segmentation method that does not rely on predefined region proposals.

Background

Instance segmentation requires locating, classifying, and segmenting each object instance in an image, which is more challenging than object detection or semantic segmentation. Traditional two‑stage methods (e.g., Mask R-CNN) achieve high accuracy but are computationally heavy. Recent anchor‑free one‑stage detectors such as CenterNet and FCOS have shown that removing anchors can improve speed and flexibility, prompting the question of whether a similar one‑stage approach can be applied to instance segmentation.

Challenges of One‑Stage Instance Segmentation

How to differentiate multiple instances of the same class without region proposals.

How to preserve pixel‑level positional information, especially around object boundaries.

Related Work

Existing methods can be divided into two‑stage and one‑stage approaches. Two‑stage methods (e.g., Mask R-CNN) follow a detect‑then‑segment pipeline. One‑stage methods are further split into global‑image‑based (e.g., InstanceFCN, YOLACT) and local‑region‑based (e.g., PolarMask, TensorMask) techniques, each with their own limitations regarding occlusion handling and speed.

CenterMask Overview

CenterMask introduces two parallel branches:

Local Shape branch : predicts a coarse, instance‑aware shape from the object’s center point.

Global Saliency branch : predicts a full‑image saliency map that retains fine‑grained spatial details.

The final mask for each instance is obtained by element‑wise multiplication of the local shape and global saliency outputs, combining instance awareness with precise localization.

Network Architecture

The backbone extracts features, which are then fed into five parallel heads: Heatmap, Offset, Shape, Size, and Saliency. Heatmap and Offset locate object centers; Shape and Size predict the coarse shape and size at each center; Saliency produces a pixel‑wise foreground‑background map.

Local Shape Prediction

Each object’s mask is decomposed into shape (a fixed‑size feature) and size (height and width). The Shape head outputs a tensor of size H×W×S×S, while the Size head outputs H×W×2. For a center (x, y), the corresponding shape feature is reshaped to an S×S matrix and resized to the predicted height‑width, forming the LocalShape representation.

Global Saliency Generation

The Saliency branch predicts a full‑resolution map indicating whether each pixel belongs to an object (foreground) or background, providing fine spatial alignment without complex feature‑alignment modules.

Experimental Results

Visualization experiments show that the Local Shape branch alone yields coarse but well‑separated masks, while the Global Saliency branch alone produces fine masks when there is no occlusion. Combining both branches (CenterMask) achieves accurate segmentation even under heavy occlusion.

On the COCO test‑dev set, CenterMask attains a competitive trade‑off between AP and FPS, outperforming other one‑stage methods while being faster than two‑stage approaches such as Mask R-CNN.

When integrated into the FCOS detector (CenterMask‑FCOS), the method reaches an AP of 38.5, demonstrating its adaptability to existing one‑stage detectors.

Future Work

The authors plan to eliminate the dependence on bounding‑box cues entirely and to explore extensions to panoptic segmentation, which unifies instance and semantic segmentation.

For full details, see the original paper: CenterMask: single shot instance segmentation with point representation .

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Deep Learning object detection CenterMask one-stage detection

Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.