Introducing CAIG: CTR‑Driven Advertising Image Generation with Open‑Source Code
CAIG leverages a multimodal large language model, a novel reward model, and product‑centered preference optimization to generate ad images that maximize click‑through rate, achieving state‑of‑the‑art performance in both online and offline evaluations.
Introduction
Existing advertising image generation methods focus on visual appeal but often underperform in real‑world click‑through rates (CTR). CAIG (CTR‑driven Advertising Image Generation) explores using CTR as the primary objective for a multimodal large language model (MLLM) to create ad images, structured into three stages.
Method
Three‑stage workflow
Stage (a): Pre‑training – a large‑scale e‑commerce multimodal dataset is used to pre‑train the MLLM, injecting domain knowledge.
Stage (b): Reward model design – a reward model (RM) with dual branches (CTR regression head and image classification head) estimates CTR and identifies attractive images.
Stage (c): CTR‑driven preference optimization – the product generation model (PM) creates background descriptions, which are fed to Stable Diffusion + ControlNet to generate images; the RM predicts CTR and guides fine‑tuning of the PM.
The pre‑training equips the MLLM with visual and textual understanding of product attributes. The RM is further fine‑tuned on massive multimodal online click data, learning to simulate human feedback. To mitigate absolute CTR variance across product categories, CTR regression is reformulated as a relative comparison task between paired images of the same product.
For each pair (I₁, I₂) with CTRs (c₁, c₂), a prompt engineering function f_instruct combines product attributes with a reward‑model‑specific question template Q_RM to produce a guiding prompt C_RM. Visual representations of the two images and the textual prompt are concatenated into a multimodal input for the RM.
Product‑centered Preference Optimization
CAIG frames higher‑CTR image generation as a preference selection problem, encouraging the generator to choose a more attractive positive image I⁺ and reject a less attractive negative image I⁻. The process consists of two steps:
Generate an image pair and compare their predicted CTRs using the RM.
Fine‑tune the generation model based on RM feedback. The PM supplies a background description y to Stable Diffusion, combined with the original product image I_o. ControlNet and inpainting ensure seamless fusion of product and generated background.
The overall algorithm flow is illustrated in the accompanying diagram.
Simple Process
Product image + instruction prompt → Prompt model generates two background descriptions → Background generation model creates two ad images → Reward model predicts CTR → Preference pair is determined.
Key Steps
Background description generation stage: Prompt model produces two distinct descriptions.
Ad image generation stage: Stable Diffusion + ControlNet generate images.
Evaluation stage: Reward model estimates CTR.
Learning stage: Click data determines positive/negative samples and updates the model.
Experiments
Extensive offline and online experiments demonstrate that CAIG outperforms existing baselines, achieving state‑of‑the‑art metrics on both offline benchmarks and live CTR performance.
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
