CTR-Driven Advertising Image Generation Using Multimodal Large Language Models
The paper presents CAIG, a CTR‑driven advertising image generation pipeline that pre‑trains a multimodal LLM on e‑commerce data, trains a reward model on CTR‑labeled image pairs, and fine‑tunes generation via product‑centric preference optimization, achieving state‑of‑the‑art online and offline performance.
This work, accepted to WWW2025, investigates the generation of advertising images for e‑commerce platforms by optimizing click‑through rate (CTR) as the primary objective. The authors explore the use of multimodal large language models (MLLMs) and introduce a novel reward model combined with a product‑centric preference optimization strategy, achieving state‑of‑the‑art performance on both online and offline metrics.
Background and Motivation : Existing ad‑image generation methods focus on aesthetic quality rather than online performance, leading to a gap between generated images and actual user preferences. Inspired by recent RLHF approaches, the authors propose training a reward model (RM) and fine‑tuning the generation model via reinforcement learning (RL) to reflect user click preferences.
Overall Solution (CAIG) : The proposed CTR‑driven Advertising Image Generation (CAIG) pipeline consists of (1) pre‑training a multimodal LLM on a large e‑commerce dataset (≈1.2 M samples) to inject domain knowledge, (2) training a reward model on paired ad images with CTR labels, and (3) applying a product‑centric preference optimization (PCPO) stage that uses the RM to fine‑tune a prompt model (PM) and generate ad images with Stable Diffusion + ControlNet.
E‑commerce Knowledge Pre‑training : Three pre‑training tasks are defined: image understanding, multimodal content understanding, and prompt generation. These tasks enable the MLLM to comprehend product images, textual attributes, and generate or rewrite prompts for background creation.
Reward Model Based on MLLM : CTR prediction is reformulated as a relative comparison between image pairs. The RM receives multimodal inputs (visual + textual) and outputs a binary preference and a point‑wise CTR regression. The loss combines binary cross‑entropy with a point‑level loss.
CTR‑Driven Optimization : The task is cast as a preference selection problem. Image pairs are generated, the RM ranks them, and the generation model is fine‑tuned using Direct Preference Optimization (DPO). To avoid over‑optimizing CTR at the expense of product‑background relevance, the authors introduce Product‑Centric Preference Optimization (PCPO), which treats product information as the sole variable and constructs additional preference pairs to enforce alignment.
Experimental Results :
Reward Model Performance: The proposed method outperforms both closed‑source (GLM4V, Claude 3.5 Sonnet, GPT‑4o, GPT‑4V) and open‑source (VAM, CG4CTR) baselines on commercial and public datasets.
Product‑Background Relevance: PCPO maintains higher matching rates across training epochs compared to standard DPO, demonstrating better preservation of product relevance.
Online Experiments: A week‑long live test on JD.com covering 44 product categories shows significant CTR lifts over baseline MLLM generation, confirming the practical impact of the CAIG approach.
Paper: https://arxiv.org/pdf/2502.06823
Code: https://github.com/Chenguoz/CAIG
JD Retail Technology
Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.