Industry Insights 14 min read

How EcomXL Supercharges E‑commerce Image Generation with SDXL Optimizations and 3‑Second Inference

This article details how Alibaba's Wanxiang Lab adapted the SDXL diffusion model for large‑scale e‑commerce image generation, introducing the EcomXL series, a weighted‑distillation fine‑tuning method, hierarchical model fusion, specialized ControlNet variants, and the SLAM inference accelerator to achieve high‑quality, controllable product images within three seconds while boosting business metrics.

NewBeeNLP
NewBeeNLP
NewBeeNLP
How EcomXL Supercharges E‑commerce Image Generation with SDXL Optimizations and 3‑Second Inference

Background

With the rise of generative AI, Stable Diffusion combined with ControlNet has become popular in e‑commerce for creating product main images that directly affect click‑through rates. Alibaba's Wanxiang Lab aims to reduce merchants' time and cost by leveraging AIGC to produce high‑quality product visuals efficiently.

EcomXL Text‑to‑Image Model

Problem Definition

Although SDXL improves semantic understanding and visual appeal over SD1.5, e‑commerce demands more realistic human portraits, diverse commercial‑style backgrounds, and seamless product‑background integration. Additionally, integrating control mechanisms (ControlNet/Lora) raises ecosystem compatibility challenges.

Model Optimization

Using both public and internal datasets, the team collected tens of millions of high‑quality human and background images. A two‑stage fine‑tuning pipeline was applied:

Stage 1: Full‑parameter fine‑tuning with a weighted‑distillation loss that emphasizes denoising loss early and gradually shifts to the original diffusion loss, preserving semantic fidelity while improving visual details.

Stage 2: Hierarchical model fusion where only layers most influential for facial quality (identified via controlled experiments on the UNet) receive weighted blending with the original SDXL weights, keeping community compatibility.

Comparison Results

EcomXL retains SDXL’s strong semantic capabilities while delivering noticeably better e‑commerce portraits and backgrounds, as shown in side‑by‑side visual comparisons.

EcomXL‑ControlNet

Beyond the base text‑to‑image model, the system integrates multiple ControlNet branches to preserve foreground fidelity, enrich backgrounds, and generate realistic body poses.

Inpainting ControlNet

A two‑phase training strategy is used: first on generic random masks, then on e‑commerce‑specific instance masks, enabling accurate background completion without distorting the product foreground.

Softedge ControlNet

Trained on tens of millions of high‑beauty images, this branch enforces edge consistency for both product outlines and auxiliary elements, using a mixture of HED, PIDI‑Net, and PIDISafe edge detectors.

SLAM: Sub‑Path Linear Approximation Model

To cut inference steps from 25 to 4 while preserving quality, the authors propose SLAM, which linearly interpolates between adjacent diffusion timesteps and adds a small distillation loss. This reduces cumulative mapping error, yielding clearer textures at low step counts.

Business Evaluation

Metrics

Visual usability: assesses portrait realism, body deformation, product edge quality, and background composition.

1‑vs‑1 win rate: blind pairwise comparison by designers.

Online adoption rate: proportion of generated images downloaded by users.

Online Results

Compared with the previous Ecom1.5 model, EcomXL improves visual usability (+5 pts), 1‑vs‑1 win rate (+2.8 pts), and adoption rate (+2 pts), leading to its deployment as the default model in Wanxiang Lab.

Conclusion

The study identifies three main shortcomings of vanilla SDXL for e‑commerce—portrait realism, background relevance, and inference latency—and addresses them through data‑driven fine‑tuning, weighted model fusion, specialized ControlNet branches, and the SLAM accelerator. The resulting EcomXL (huggingface‑ecomxl) and SLAM (huggingface‑slam) solutions are now fully deployed, delivering near‑real‑time, high‑fidelity product images and setting a new benchmark for AIGC in commercial visual content.

Inference AccelerationImage GenerationAIGCControlNetEcomXLSDXLE‑commerce
NewBeeNLP
Written by

NewBeeNLP

Always insightful, always fun

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.