How Alibaba’s Semantic Human Matting Achieves Fully Automatic High‑Precision Image Cutouts

This article introduces Alibaba’s intelligent matting editor and its Semantic Human Matting (SHM) algorithm, detailing the integration of semantic segmentation and deep matting networks, the fusion module, training strategy, experimental results, and the deployment of an online fully‑automatic cutout tool for designers.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How Alibaba’s Semantic Human Matting Achieves Fully Automatic High‑Precision Image Cutouts

Background

Image matting extracts a target foreground from an image with high precision and is widely used in image editing, mixed reality, creative composition, and film production. In large e‑commerce platforms like Alibaba, high‑quality cutouts are essential for product display and advertising, yet traditional methods require complex, time‑consuming workflows.

Problem Statement

Conventional approaches either rely on semantic segmentation, which yields hard edges and cannot handle translucent regions (hair, wedding dresses, glass, smoke), or on image matting that needs a user‑provided trimap, making the process interactive and labor‑intensive. The goal is a fully automatic solution that delivers comparable or better quality without manual trimap creation.

Semantic Human Matting (SHM) Pipeline

SHM combines a semantic segmentation module (T‑Net) and a deep matting module (M‑Net) through a differentiable fusion module, enabling end‑to‑end training.

The network predicts three channels (foreground, background, uncertain region) from T‑Net and detailed alpha values from M‑Net. The fusion module adaptively merges semantic and fine‑detail information to produce the final alpha matte.

SHM network architecture
SHM network architecture

Key Contributions

SHM is the first fully deep‑learning‑based automatic human matting algorithm that jointly learns high‑level semantic cues and low‑level visual details.

A simple yet effective differentiable fusion module that lets T‑Net and M‑Net cooperate at the pixel level.

A large‑scale human matting dataset with 52,511 training and 1,400 testing images, the biggest in the matting field to date.

Fusion Module Details

Let F, B and U denote the raw (pre‑softmax) outputs of T‑Net for foreground, background and uncertain region. The foreground probability is: p_f = softmax(F) = \frac{e^{F}}{e^{F}+e^{B}+e^{U}} Analogous formulas compute background and uncertain probabilities. M‑Net outputs an alpha estimate \alpha_M. The final alpha for a pixel is:

\alpha = p_f \cdot \alpha_M + p_u \cdot \alpha_M + p_b \cdot \alpha_M

In practice, when the uncertain probability p_u is high, the output relies more on M‑Net’s fine‑detail prediction; when p_u is low, the semantic prediction dominates, ensuring coherent foreground‑background separation.

Training Loss

Training proceeds in three stages: pre‑training T‑Net (semantic segmentation), pre‑training M‑Net (alpha prediction using the loss from Xu et al.), and end‑to‑end refinement of the whole network. The overall loss is: L = L_{alpha} + \lambda_{trimap} L_{trimap} where L_{alpha} combines alpha prediction loss and compositional loss, and L_{trimap} is a cross‑entropy loss on the generated trimap (\lambda = 0.01).

Experiments

We built the largest known matting dataset (52,511 training, 1,400 testing images) and evaluated using standard metrics SAD, MSE, Gradient Error, and Connectivity Error. SHM outperforms baseline segmentation‑plus‑matting pipelines across all metrics.

Quantitative results
Quantitative results

Ablation studies confirm the importance of each component (T‑Net, M‑Net, Fusion Module). Visual comparisons show SHM achieving results comparable to state‑of‑the‑art interactive methods that require a user‑provided trimap.

Ablation study
Ablation study

Online Interactive Matting Editor

Based on the SHM algorithm, Alibaba Mama launched an online matting editor. Users upload an image; the backend automatically predicts the cutout. If the result is satisfactory, they can save it directly. Otherwise, interactive tools (select subject, erase background, fine‑tune) allow minimal manual adjustments.

Upload interface
Upload interface
Automatic result
Automatic result

The editor streamlines the workflow for designers, eliminating the need for specialized matting training while handling challenging translucent regions.

References

Quan Chen, Tiezheng Ge, Yanyu Xu, Zhiqiang Zhang, XinXin Yang, KunGai. "Semantic Human Matting." ACM Multimedia 2018.

Wang Jue and Michael F. Cohen. "Image and video matting: a survey." Foundations and Trends in Computer Graphics and Vision, 2008.

Ning Xu et al. "Deep Image Matting." CVPR 2017.

Christoph Rhemann et al. "A perceptually motivated online benchmark for image matting." CVPR 2009.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AlibabaDeep Learningsemantic segmentationimage-mattingautomatic cutout
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.