How AIGC Is Transforming E‑commerce with Personalized Visual Content

This article explains how large‑model AIGC technology reshapes e‑commerce by enabling mass‑produced, user‑profile‑driven visual assets, detailing the evolution from early online trade to the 2.0 era, the technical pipeline of multimodal models, and the practical impact on merchants.

JD Retail Technology
JD Retail Technology
JD Retail Technology
How AIGC Is Transforming E‑commerce with Personalized Visual Content

Amid the wave of AIGC technologies across industries, visual generation is becoming a core force reshaping the e‑commerce ecosystem. As e‑commerce shifts from simple product listing to content‑driven experiences, brands urgently need massive, diverse, and precise visual assets, a demand that traditional manual creation cannot meet.

Large‑model AIGC offers a breakthrough: it can batch‑generate product images and live‑stream videos, and tailor personalized assets for each user based on their profile, turning content from mass broadcasting to precise drip‑feeding. Cost reductions of up to 90% and conversion gains of 30% illustrate its industrial impact.

The speaker, Jason, head of JD Retail Visual and AIGC, breaks down the technical framework for “thousand‑people‑one‑face” personalized material generation, describing two core models, the empowerment practice for merchants, and future upgrades.

Historically, e‑commerce evolved from offline trade to the 1.0 era of search‑based personalization in the 1990s‑2000s, and entered the 2.0 era in 2022 with the rise of ChatGPT, Midjourney, and multimodal AI, promising smarter demand matching, efficient logistics, 24/7 AI service, and immersive virtual shopping.

“Thousand‑people‑one‑face” product material means generating distinct visual assets for different buyer personas—outdoor‑function seekers, aesthetic‑focused shoppers, and price‑sensitive consumers—based on multimodal understanding of both product and user data.

The technical pipeline consists of four key models: a multimodal large model that ingests product metadata and user profiles, a generation model that creates multiple visual assets according to model‑generated instructions, a quality‑estimation model that filters out low‑quality outputs, and a distribution system that serves the remaining assets. Feedback from live traffic iteratively refines all models.

Because exhaustive real‑time inference is infeasible, the framework is degraded to “thousand‑people‑hundred‑faces” or “thousand‑people‑ten‑faces,” focusing on the top K user groups per product.

A case study on a JD‑Zao coffee product shows how the multimodal model identifies target groups such as fitness enthusiasts, office workers, exam students, low‑sugar dieters, and outdoor lovers, then generates tailored visual scenes via a controllable diffusion model.

The multimodal understanding model follows a Vision‑Language architecture with tokenizers for each modality feeding a Mixture‑of‑Experts decoder‑only LLM, trained on both generic and retail‑specific tasks to retain broad capabilities while specializing in e‑commerce reasoning.

Post‑training uses a reinforcement‑learning pipeline (Follow‑GRPO) that samples answer groups for each question and scores them across logical consistency, clarity, semantic similarity, and format compliance.

Evaluation of the OxygenVLM model shows comparable performance on open‑source benchmarks while achieving significant gains on retail‑specific tasks.

The controllable visual generation model is a multi‑condition diffusion system where product, text, layout, and patches act as controllable inputs. Although current text‑prompt following is limited, future work aims to unify all conditions into natural language.

Since 2023, the visual generation stack has evolved from Stable Diffusion + ControlNet to DiT + Redux and now to VAE‑based context integration, moving toward a unified model that merges understanding and generation.

The upgraded JD “JingDianDian” platform, now called OxygenVision, adds a conversational UI, autonomous task planning by large models, algorithmic consistency guarantees, and seamless integration with JD’s AB testing system.

Four major upgrades include: 1) natural‑language driven image creation, 2) model‑guided task decomposition and execution, 3) diversified visual output while preserving product consistency, and 4) cross‑border, multi‑language support for external merchants.

Future capabilities will add bulk material generation for SKU lists, short‑form (5 s) and long‑form (30 s) video creation, goal‑oriented generation tuned to click‑through or conversion targets, and expanded multilingual support for global merchants.

For a hands‑on experience, visit the refreshed JingDianDian platform at oxygen-vision.jd.com .

e-commercemultimodal AIpersonalizationlarge language modelsAIGCvisual generation
JD Retail Technology
Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.