How EcomXL Supercharges E‑commerce Image Generation with SDXL Optimizations and 3‑Second Inference
This article details how Alibaba's Wanxiang Lab adapted the SDXL diffusion model for large‑scale e‑commerce image generation, introducing the EcomXL series, a weighted‑distillation fine‑tuning method, hierarchical model fusion, specialized ControlNet variants, and the SLAM inference accelerator to achieve high‑quality, controllable product images within three seconds while boosting business metrics.
Background
With the rise of generative AI, Stable Diffusion combined with ControlNet has become popular in e‑commerce for creating product main images that directly affect click‑through rates. Alibaba's Wanxiang Lab aims to reduce merchants' time and cost by leveraging AIGC to produce high‑quality product visuals efficiently.
EcomXL Text‑to‑Image Model
Problem Definition
Although SDXL improves semantic understanding and visual appeal over SD1.5, e‑commerce demands more realistic human portraits, diverse commercial‑style backgrounds, and seamless product‑background integration. Additionally, integrating control mechanisms (ControlNet/Lora) raises ecosystem compatibility challenges.
Model Optimization
Using both public and internal datasets, the team collected tens of millions of high‑quality human and background images. A two‑stage fine‑tuning pipeline was applied:
Stage 1: Full‑parameter fine‑tuning with a weighted‑distillation loss that emphasizes denoising loss early and gradually shifts to the original diffusion loss, preserving semantic fidelity while improving visual details.
Stage 2: Hierarchical model fusion where only layers most influential for facial quality (identified via controlled experiments on the UNet) receive weighted blending with the original SDXL weights, keeping community compatibility.
Comparison Results
EcomXL retains SDXL’s strong semantic capabilities while delivering noticeably better e‑commerce portraits and backgrounds, as shown in side‑by‑side visual comparisons.
EcomXL‑ControlNet
Beyond the base text‑to‑image model, the system integrates multiple ControlNet branches to preserve foreground fidelity, enrich backgrounds, and generate realistic body poses.
Inpainting ControlNet
A two‑phase training strategy is used: first on generic random masks, then on e‑commerce‑specific instance masks, enabling accurate background completion without distorting the product foreground.
Softedge ControlNet
Trained on tens of millions of high‑beauty images, this branch enforces edge consistency for both product outlines and auxiliary elements, using a mixture of HED, PIDI‑Net, and PIDISafe edge detectors.
SLAM: Sub‑Path Linear Approximation Model
To cut inference steps from 25 to 4 while preserving quality, the authors propose SLAM, which linearly interpolates between adjacent diffusion timesteps and adds a small distillation loss. This reduces cumulative mapping error, yielding clearer textures at low step counts.
Business Evaluation
Metrics
Visual usability: assesses portrait realism, body deformation, product edge quality, and background composition.
1‑vs‑1 win rate: blind pairwise comparison by designers.
Online adoption rate: proportion of generated images downloaded by users.
Online Results
Compared with the previous Ecom1.5 model, EcomXL improves visual usability (+5 pts), 1‑vs‑1 win rate (+2.8 pts), and adoption rate (+2 pts), leading to its deployment as the default model in Wanxiang Lab.
Conclusion
The study identifies three main shortcomings of vanilla SDXL for e‑commerce—portrait realism, background relevance, and inference latency—and addresses them through data‑driven fine‑tuning, weighted model fusion, specialized ControlNet branches, and the SLAM accelerator. The resulting EcomXL (huggingface‑ecomxl) and SLAM (huggingface‑slam) solutions are now fully deployed, delivering near‑real‑time, high‑fidelity product images and setting a new benchmark for AIGC in commercial visual content.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
