How Alibaba’s Semantic Human Matting Achieves Fully Automatic High‑Precision Image Cutouts
This article introduces Alibaba’s intelligent matting editor and its Semantic Human Matting (SHM) algorithm, detailing the integration of semantic segmentation and deep matting networks, the fusion module, training strategy, experimental results, and the deployment of an online fully‑automatic cutout tool for designers.
Background
Image matting extracts a target foreground from an image with high precision and is widely used in image editing, mixed reality, creative composition, and film production. In large e‑commerce platforms like Alibaba, high‑quality cutouts are essential for product display and advertising, yet traditional methods require complex, time‑consuming workflows.
Problem Statement
Conventional approaches either rely on semantic segmentation, which yields hard edges and cannot handle translucent regions (hair, wedding dresses, glass, smoke), or on image matting that needs a user‑provided trimap, making the process interactive and labor‑intensive. The goal is a fully automatic solution that delivers comparable or better quality without manual trimap creation.
Semantic Human Matting (SHM) Pipeline
SHM combines a semantic segmentation module (T‑Net) and a deep matting module (M‑Net) through a differentiable fusion module, enabling end‑to‑end training.
The network predicts three channels (foreground, background, uncertain region) from T‑Net and detailed alpha values from M‑Net. The fusion module adaptively merges semantic and fine‑detail information to produce the final alpha matte.
Key Contributions
SHM is the first fully deep‑learning‑based automatic human matting algorithm that jointly learns high‑level semantic cues and low‑level visual details.
A simple yet effective differentiable fusion module that lets T‑Net and M‑Net cooperate at the pixel level.
A large‑scale human matting dataset with 52,511 training and 1,400 testing images, the biggest in the matting field to date.
Fusion Module Details
Let F, B and U denote the raw (pre‑softmax) outputs of T‑Net for foreground, background and uncertain region. The foreground probability is: p_f = softmax(F) = \frac{e^{F}}{e^{F}+e^{B}+e^{U}} Analogous formulas compute background and uncertain probabilities. M‑Net outputs an alpha estimate \alpha_M. The final alpha for a pixel is:
\alpha = p_f \cdot \alpha_M + p_u \cdot \alpha_M + p_b \cdot \alpha_MIn practice, when the uncertain probability p_u is high, the output relies more on M‑Net’s fine‑detail prediction; when p_u is low, the semantic prediction dominates, ensuring coherent foreground‑background separation.
Training Loss
Training proceeds in three stages: pre‑training T‑Net (semantic segmentation), pre‑training M‑Net (alpha prediction using the loss from Xu et al.), and end‑to‑end refinement of the whole network. The overall loss is: L = L_{alpha} + \lambda_{trimap} L_{trimap} where L_{alpha} combines alpha prediction loss and compositional loss, and L_{trimap} is a cross‑entropy loss on the generated trimap (\lambda = 0.01).
Experiments
We built the largest known matting dataset (52,511 training, 1,400 testing images) and evaluated using standard metrics SAD, MSE, Gradient Error, and Connectivity Error. SHM outperforms baseline segmentation‑plus‑matting pipelines across all metrics.
Ablation studies confirm the importance of each component (T‑Net, M‑Net, Fusion Module). Visual comparisons show SHM achieving results comparable to state‑of‑the‑art interactive methods that require a user‑provided trimap.
Online Interactive Matting Editor
Based on the SHM algorithm, Alibaba Mama launched an online matting editor. Users upload an image; the backend automatically predicts the cutout. If the result is satisfactory, they can save it directly. Otherwise, interactive tools (select subject, erase background, fine‑tune) allow minimal manual adjustments.
The editor streamlines the workflow for designers, eliminating the need for specialized matting training while handling challenging translucent regions.
References
Quan Chen, Tiezheng Ge, Yanyu Xu, Zhiqiang Zhang, XinXin Yang, KunGai. "Semantic Human Matting." ACM Multimedia 2018.
Wang Jue and Michael F. Cohen. "Image and video matting: a survey." Foundations and Trends in Computer Graphics and Vision, 2008.
Ning Xu et al. "Deep Image Matting." CVPR 2017.
Christoph Rhemann et al. "A perceptually motivated online benchmark for image matting." CVPR 2009.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
