Introducing CAIG: CTR‑Driven Advertising Image Generation with Open‑Source Code

CAIG leverages a multimodal large language model, a novel reward model, and product‑centered preference optimization to generate ad images that maximize click‑through rate, achieving state‑of‑the‑art performance in both online and offline evaluations.

AIWalker
AIWalker
AIWalker
Introducing CAIG: CTR‑Driven Advertising Image Generation with Open‑Source Code

Introduction

Existing advertising image generation methods focus on visual appeal but often underperform in real‑world click‑through rates (CTR). CAIG (CTR‑driven Advertising Image Generation) explores using CTR as the primary objective for a multimodal large language model (MLLM) to create ad images, structured into three stages.

Method

Three‑stage workflow

Stage (a): Pre‑training – a large‑scale e‑commerce multimodal dataset is used to pre‑train the MLLM, injecting domain knowledge.

Stage (b): Reward model design – a reward model (RM) with dual branches (CTR regression head and image classification head) estimates CTR and identifies attractive images.

Stage (c): CTR‑driven preference optimization – the product generation model (PM) creates background descriptions, which are fed to Stable Diffusion + ControlNet to generate images; the RM predicts CTR and guides fine‑tuning of the PM.

The pre‑training equips the MLLM with visual and textual understanding of product attributes. The RM is further fine‑tuned on massive multimodal online click data, learning to simulate human feedback. To mitigate absolute CTR variance across product categories, CTR regression is reformulated as a relative comparison task between paired images of the same product.

For each pair (I₁, I₂) with CTRs (c₁, c₂), a prompt engineering function f_instruct combines product attributes with a reward‑model‑specific question template Q_RM to produce a guiding prompt C_RM. Visual representations of the two images and the textual prompt are concatenated into a multimodal input for the RM.

Product‑centered Preference Optimization

CAIG frames higher‑CTR image generation as a preference selection problem, encouraging the generator to choose a more attractive positive image I⁺ and reject a less attractive negative image I⁻. The process consists of two steps:

Generate an image pair and compare their predicted CTRs using the RM.

Fine‑tune the generation model based on RM feedback. The PM supplies a background description y to Stable Diffusion, combined with the original product image I_o. ControlNet and inpainting ensure seamless fusion of product and generated background.

The overall algorithm flow is illustrated in the accompanying diagram.

Simple Process

Product image + instruction prompt → Prompt model generates two background descriptions → Background generation model creates two ad images → Reward model predicts CTR → Preference pair is determined.

Key Steps

Background description generation stage: Prompt model produces two distinct descriptions.

Ad image generation stage: Stable Diffusion + ControlNet generate images.

Evaluation stage: Reward model estimates CTR.

Learning stage: Click data determines positive/negative samples and updates the model.

Experiments

Extensive offline and online experiments demonstrate that CAIG outperforms existing baselines, achieving state‑of‑the‑art metrics on both offline benchmarks and live CTR performance.

CTRopen-sourcereinforcement learningmultimodal LLMad image generationproduct preference optimization
AIWalker
Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.