Artificial Intelligence 16 min read

Mask‑Guided Diffusion for Precise Product Image Generation

Mask‑Guided Diffusion combines instance‑mask training, Masked Canny ControlNet, and Mask‑guided Attribute Binding to preserve product details, correctly bind attributes, fix hand distortion, and generate uniform colored backgrounds, enabling merchants to quickly create high‑quality, controllable product images with Stable Diffusion.

Alimama Tech
Alimama Tech
Alimama Tech
Mask‑Guided Diffusion for Precise Product Image Generation

With the rapid advancement of AIGC technologies such as Stable Diffusion, it becomes feasible to generate product images from textual prompts. This motivates the development of a system that can automatically replace product backgrounds and adjust model appearances.

The authors built an AI creative production tool, Wanxiang Lab, which integrates Stable Diffusion with control models (e.g., ControlNet) to let merchants generate diverse background scenes for a single product within minutes.

Key challenges identified include inaccurate product feature preservation, trade‑offs between foreground detail and background blur, attribute‑binding failures, hand‑distortion, and difficulty in generating uniform color backgrounds.

To address product/element control, two methods are proposed: (1) instance‑mask training, where high‑quality Taobao product images are segmented to create instance masks for inpainting model training, reducing over‑completion; (2) Masked Canny ControlNet inference, a training‑free strategy that expands a foreground mask, multiplies it with ControlNet output, and feeds the result to the U‑Net decoder, thereby preserving product edges while avoiding background interference.

For model attribute control, the paper introduces Mask‑guided Attribute Binding (MGAB). By extracting object masks from the prompt, a language‑guided loss aligns the attention maps of attribute and object tokens, ensuring that specified attributes (e.g., color) correctly follow the intended objects under visual control conditions.

Hand‑distortion is mitigated by reconstructing a 3‑D hand model from the distorted image, rendering depth and Canny maps, and using ControlNet to locally repaint the hand region, dramatically improving hand realism.

Pure‑color background generation combines Shuffle ControlNet with a local mask and a LoRA fine‑tuned on high‑quality white‑background images. A post‑processing color‑matcher then maps the white background to any target color, achieving stable, uniform backgrounds.

The system has been deployed in Wanxiang Lab, serving many merchants. Future work includes accelerating diffusion inference, improving foreground‑background lighting fusion, and further enhancing control precision.

computer visionAIimage generationdiffusionControlNetMask Guidance
Alimama Tech
Written by

Alimama Tech

Official Alimama tech channel, showcasing all of Alimama's technical innovations.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.