How AI Diffusion Models Revolutionize E‑commerce Ad Image Creation
This article presents JD Advertising's 2023 innovations that combine relation‑aware diffusion models, category‑aware background generation, and planning‑and‑rendering pipelines to automatically produce high‑quality, scalable, and personalized e‑commerce ad posters, addressing efficiency, cost, and creative limitations of manual design.
Introduction
E‑commerce advertising images need to capture attention, convey brand values, and build emotional connections, but traditional manual creation is inefficient and costly. Recent advances in AIGC still lack point‑of‑sale information, scalability, personalization, and effective presentation. JD Advertising proposes a series of innovations in 2023 to address these challenges.
Poster Layout Generation with a Relation‑Aware Diffusion Model
Technical Background
Generating poster layouts involves predicting the positions and categories of visual elements, which is crucial for aesthetic appeal and information delivery. Manual design is time‑consuming and expensive, prompting research into automatic layout generation.
Early methods focused only on graphic relationships, ignoring visual content, while later content‑aware methods still missed two key factors: the role of text and the geometric relationships among elements.
The proposed relation‑aware diffusion model jointly considers visual‑textual and geometric relationships. By following a noise‑to‑layout paradigm, the model iteratively denoises sampled boxes, extracts RoI features via an image encoder, and employs a Visual‑Text Relation Awareness Module (VTRAM) and a Geometric Relation Awareness Module (GRAM) to incorporate both modalities.
Diffusion‑Based Layout Generation
The diffusion process adds Gaussian noise to a layout, while the denoising process gradually restores a coherent layout, enabling controllable generation through predefined layouts or text modifications.
Visual‑Text Relation Awareness (VTRAM)
VTRAM aligns visual and textual features by concatenating positional embeddings with RoI features, then applying cross‑attention where visual features serve as queries and textual features as keys and values, producing multimodal fused features.
Geometric Relation Awareness (GRAM)
GRAM computes relative position features between RoIs, encodes them with sinusoidal embeddings, and normalizes geometric weights via softmax to enhance spatial understanding. Different element types receive distinct positioning strategies, and RoI features are projected to combine visual and categorical information.
Category‑Common and Personalized Style Background Generation
Technical Background
Product advertising background generation aims to create realistic backgrounds for product cut‑out images, improving click‑through rates. Existing methods fall into text‑to‑image ("text‑to‑image") and image‑to‑image ("image‑to‑image") paradigms, each with limitations such as prompt engineering difficulty and loss of layout details.
The proposed method generates backgrounds that inherit layout, composition, color, and style from a reference advertisement image, using a pre‑trained Stable Diffusion model, a Category‑Common Generator (CG), and a Personalized Generator (PG).
Category‑Common Generation
CG extracts product information from the cut‑out image and generates a generic background for the product’s category. It replaces the standard attention module with a mask‑aware attention that incorporates the product mask, enabling direct mapping from category names to style prompts.
Personalized Style Generation
PG overlays personalized information from a reference image onto the generic background without requiring textual prompts. PG’s output is filtered by the product mask to ensure style influences only the background region.
End‑to‑End Product Poster Generation via Planning and Rendering
Technical Background
High‑quality product posters require coherent element layout and harmonious backgrounds. Existing pipelines that simply combine image‑inpainting and layout generation suffer from background complexity and limited layout diversity.
The proposed solution mimics human designers by separating planning (layout prediction) and rendering (image synthesis).
Layout Generation with a Planning Network
PlanNet encodes product images and textual descriptions, then uses a Layout Decoder (two fully‑connected layers and N transformer blocks) to predict positions for the product and other visual elements.
Background Generation with a Rendering Network
RenderNet receives the planned layout and product image, encodes layout masks, fuses spatial information via a Spatial Fusion Module, and feeds combined visual and layout features into ControlNet to guide Stable Diffusion, producing the final poster.
Conclusion and Outlook
Technical Summary
The presented solutions address the lack of point‑of‑sale information, scalability, and personalization in AIGC advertising images by (1) building a relation‑aware diffusion model for layout generation, (2) integrating category‑common and personalized style generators into diffusion models, and (3) proposing a planning‑and‑rendering framework (P&R) that jointly optimizes layout and background synthesis.
Future Directions
Future work will focus on improving controllability, enhancing multimodal integration of text, image, and video, and delivering personalized ad creatives tailored to specific user groups.
JD Cloud Developers
JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.