Artificial Intelligence 13 min read

Can Multimodal LLMs Boost Ad Click‑Through Rates? Introducing CTR‑Driven Image Generation

This paper presents a CTR‑driven advertising image generation framework that leverages multimodal large language models, reward modeling, and reinforcement learning to produce product‑centric ad visuals with higher click‑through performance, validated by extensive offline and online experiments.

JD Cloud Developers

Mar 13, 2025

Can Multimodal LLMs Boost Ad Click‑Through Rates? Introducing CTR‑Driven Image Generation

Abstract

Advertising images are crucial for e‑commerce platforms, yet most existing methods focus on aesthetic quality rather than online performance. We explore multimodal large language models (MLLMs) for ad image generation, optimizing click‑through rate (CTR) as the primary objective. After pre‑training MLLMs on a large e‑commerce multimodal dataset, we fine‑tune them with a novel reward model via reinforcement learning (RL) that jointly leverages multimodal features to reflect user click preferences. A product‑centric preference optimization strategy ensures generated backgrounds align with product attributes, improving overall relevance and effectiveness. Experiments demonstrate state‑of‑the‑art performance on both online and offline metrics.

1. Background and Current Situation

Advances in image generation enable realistic product backgrounds, but most methods prioritize offline metrics such as visual quality or semantic consistency, neglecting the link between visual content and CTR. This gap leads to a mismatch between generated ads and actual user preferences.

Inspired by recent RLHF approaches, we train a reward model (RM) and use RL to fine‑tune the generator, with the RM providing rewards that guide optimization. Existing CTR prediction methods lack strong visual understanding and struggle to fuse multimodal features.

Moreover, aligning background and product is essential. Prior RL algorithms focus solely on reward maximization, ignoring the balance between visual appeal and product relevance, which can produce mismatched backgrounds for unrelated product categories.

2. Overall Solution

We propose Click‑through‑Rate‑Driven Advertising Image Generation (CAIG). First, we pre‑train a multimodal LLM on a large e‑commerce dataset to inject domain knowledge, forming the basis for a Prompt Model (PM) and Reward Model (RM). The RM is initialized from the pre‑trained MLLM and further trained on massive multimodal click data to simulate human feedback. Finally, a product‑centric preference optimization (PCPO) stage uses RM feedback to fine‑tune the PM, producing attractive and product‑relevant ad images.

3. E‑commerce Knowledge Pre‑training

To address scalable ad creative generation, we pre‑train on a 1.2 million‑sample multimodal e‑commerce dataset from JD.com. The pre‑training tasks include:

Image understanding: describe product or background from the product image.

Multimodal content understanding: generate background or product title from multimodal information (title, category, tags).

Prompt generation: create or rewrite prompts based on multimodal product data.

4. Reward Model Based on MLLM

We reformulate CTR prediction as a relative comparison between image pairs. Each pair contains two ad images of the same product with their CTRs. Product attributes are combined with a task‑specific template Q_RM, and a prompt C_RM is generated via a prompt‑engineering function f_instruct. Visual and textual representations are concatenated as multimodal input.

The MLLM processes the multimodal input to produce hidden states H; the last token serves as a discriminative representation. A classification head maps this token to a binary probability distribution p. Additionally, a point‑level loss via a CTR regression branch refines predictions. The final RM loss combines binary cross‑entropy and point‑level loss.

5. CTR‑Driven Optimization

The task is cast as a preference selection problem: generate image pairs, let the RM compare their CTRs, and fine‑tune the generator based on RM feedback. The pipeline uses the PM to produce background descriptions, feeds them to Stable Diffusion with ControlNet inpainting to create product‑centric backgrounds. Because collecting real CTR feedback is costly, the RM provides real‑time preference signals. We adopt Direct Preference Optimization (DPO) as the base strategy:

Here I_o and C denote the original product image and its instruction.

To avoid over‑optimizing CTR at the expense of product‑background harmony, we introduce Product‑Centric Preference Optimization (PCPO). PCPO treats product information as the sole variable, constructing additional preference pairs (I_o, y⁺, y⁻) where y⁺ aligns better with product features than y⁻. The PCPO objective encourages the model to generate backgrounds that are both attractive and product‑consistent.

The DPO and PCPO losses are jointly optimized.

6. Experimental Results

Reward Model Performance

We evaluated on commercial and public datasets, comparing against various open‑source and closed‑source MLLM‑based models. Closed‑source models (GLM4V, Claude 3.5 Sonnet, GPT‑4o, GPT‑4V) achieved near‑random pairwise accuracy (~50%). Open‑source models (VAM, CG4CTR) improved modestly but remained limited by weak visual representations. Our method achieved state‑of‑the‑art performance across all datasets.

Product‑Background Relevance

Using the same RM and training epochs, we compared PCPO with standard DPO. After 5 epochs, DPO’s matching rate dropped from 0.842 to 0.597, whereas PCPO maintained 0.798, a 33.7% improvement over DPO at the same stage.

Qualitative analysis further shows PCPO generates backgrounds better aligned with product semantics.

Online Experiments

We conducted a week‑long online test on recommendation ads, generating two images for 44 product categories (covering almost all common items). Compared with the baseline pre‑trained MLLM, our CAIG method consistently improved CTR across all categories and the top five categories. Using PCPO instead of DPO further boosted CTR by focusing on product features.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Reward Model multimodal LLM advertising image generation CTR optimization

Written by

JD Cloud Developers

JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.