How to Quantify a “Good Image” for AI‑Generated E‑Commerce Graphics?
This article explains how to define and objectively evaluate the quality of AI‑generated product images for e‑commerce by decoupling assessment from the generation pipeline, selecting quantifiable metrics such as CLIPScore and Inception Score, building a lightweight evaluation system, cleaning and labeling data, and validating the approach with real‑world business and model datasets.
1. When AI‑generated images become a must, how do we define a “good image”?
In e‑commerce, the quality of product main images and marketing posters directly impacts click‑through rates. The Youzan AIGC project provides a one‑stop mobile solution – upload product image → smart cut‑out → background generation – to lower creation barriers for non‑designers. However, traditional testing excels at workflow verification but struggles to assess generated content quality.
Is the image clear and professional?
Does the content precisely match the textual description?
Is the style suitable for the merchant’s scenario?
These conversion‑critical dimensions rely on subjective, time‑consuming human judgment and lack objective, quantifiable standards. To break this deadlock, we propose an independent third‑party quality assessment framework that does not interfere with the generation process and outputs a quantitative quality score for each image.
2. Metric selection and system architecture: How to implement image evaluation?
We first asked two questions: (1) How do we define a “good image”? (2) How can the system acquire this judgment capability? The answer is a set of quantifiable metrics that are open‑source, widely validated, and easy to integrate.
After reviewing common image‑text quality metrics (CLIPScore, IS, FID, BLIPScore, PickScore), we selected two core metrics for the first phase:
Inception Score (IS) : measures image recognizability and diversity using an Inception‑v3 classifier trained on ImageNet.
CLIPScore : computes semantic similarity between image and prompt using a CLIP model.
These cover the two key dimensions – image quality and semantic alignment – and form a basic yet effective automated evaluation framework.
2.1 CLIPScore
CLIPScore evaluates the semantic consistency between the generated image and its prompt. It uses the OpenAI CLIP model to encode both image ( CLIP_image) and text ( CLIP_text) into vectors and calculates cosine similarity. The score ranges theoretically from –1 to 1, but in practice concentrates in the 0‑0.6 interval (e.g., MUGE dataset).
We employ the Chinese‑CLIP model OFA‑Sys/Chinese‑CLIP , trained on ~200 million Chinese image‑text pairs, which better understands Chinese prompts such as “a red bicycle on a grass field”. Example scores:
Exact match (red bike on grass): 0.55‑0.60
Bike correct but background missing: ~0.40
Wrong object (yellow truck): <0.20
Thus CLIPScore reflects how well the image fulfills the textual intent, independent of visual aesthetics.
2.2 Inception Score (IS)
IS focuses on image recognizability and diversity. It feeds each image into an Inception‑v3 classifier, obtains class probability distributions, and computes the KL‑divergence between the marginal distribution and each image’s distribution. Higher IS indicates clear, varied images. The score has no fixed upper bound; typical values lie between 1 and 15.
IS is unsuitable for single‑image evaluation; it requires a batch of images (ideally >500) to produce stable statistics.
2.3 System architecture overview
NSQ consumer module : listens to generate_img_data_collect_topic, writes incoming data to a local CSV.
Batch API call module : reads the CSV, invokes Dubbo APIs to generate images, records results and errors.
Image quality evaluation module : downloads images and computes CLIPScore and IS, optionally weighting them into a total score.
Key scripts include message_listener.py, consumer_util.py, generate_request.py, dubbo_util.py, evaluate_clipscore.py, evaluate_is.py, and main.py. Example commands:
python message_listener.py # default output.csv
python message_listener.py --output output_sample.csv
python generate_request.py # default input data.csv
python generate_request.py --input data_sample.csv --error error_sample.csv
python evaluate_clipscore.py # default output result_clip.csv
python evaluate_is.py # default output result_is.csv
python main.py --clip_weight 0.6 --is_weight 0.43. Project validation: Does the system really work?
We used real‑world listening data as evaluation input and discovered three major issues:
Prompts were JSON payloads containing system messages, parameters, and control flow, making them unusable for CLIPScore.
Most user utterances were non‑descriptive commands (“not good, generate another one”), lacking visual semantics.
Only 44 images were available, all with similar subjects, rendering IS ineffective.
To address these, we performed data cleaning:
Removed system messages and control statements.
Filtered out command‑style language.
Extracted concise visual phrases (e.g., “green forest”, “lavender”).
After cleaning, CLIPScore improved from an average of 0.328 to 0.388, while IS remained low (~2.0) due to limited sample size.
We further added image‑subject labeling (e.g., adding “cake” to the prompt “taro lavender”), which raised CLIPScore to 0.449 and slightly lowered IS to 1.935, confirming that richer semantic prompts boost alignment scores.
3.1 Data‑driven test dataset
We built a structured dataset composed of product main images paired with controlled prompts. Prompt construction follows two principles: (1) stay close to real e‑commerce language, and (2) provide structured, diverse expressions. An example template guides LLMs to generate background‑only prompts for various product types.
Product image selection was based on the top‑9 categories covering ~66 % of shop traffic. For each category we sampled 8‑20 representative images, achieving a balanced distribution.
3.2 Batch results
Two datasets were evaluated:
Business dataset (1000 items, real‑world distribution, cake ≈ 40 %). CLIPScore average ~0.35, IS ≈ 2.0.
Model dataset (1000 items, evenly distributed across categories). CLIPScore slightly higher, IS significantly higher due to greater visual diversity.
Analysis shows that category imbalance reduces both semantic alignment and diversity, while balanced samples improve IS.
3.3 Jira optimization cases
Case CSWT‑173889 : Generated images sometimes placed the product off‑center or partially visible. The evaluation system flagged low CLIPScore (<0.3) and low classification confidence, triggering automatic re‑generation and reducing merchant “retry” rates.
Prompt refinement : Ambiguous user prompts (“make it more upscale”) were rewritten to concrete descriptions (“marble background with gold accents”). AB testing showed CLIPScore rising from 0.38 to 0.51, confirming the effectiveness of prompt polishing.
4. Future: Building a smarter, business‑aware evaluation engine
We plan three evolution tracks:
Technical evolution : expand metric library, modularize components, enable dynamic weighting based on business stage.
Business empowerment : maintain a live data pool covering major categories, create a “score‑issue‑improvement” feedback loop, and provide visual dashboards for operators, product managers, and merchants.
System expansion : expose the evaluation service via standardized APIs for seamless integration into e‑commerce, marketing, and design pipelines, and tailor metric sets per scenario (full‑stack vs lightweight).
By continuously refining this framework, we aim to turn vague aesthetic judgments into reliable data signals, driving AI‑generated content from merely “usable” to truly “practical, stable, and controllable”.
Youzan Coder
Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
