Artificial Intelligence 10 min read

CTR-Driven Advertising Image Generation with Multimodal Large Language Models

This paper proposes CAIG, a novel method for generating high-CTR advertising images using multimodal large language models, combining reinforcement learning and preference optimization to align generated content with product features.

JD Tech Talk

Mar 13, 2025

CTR-Driven Advertising Image Generation with Multimodal Large Language Models

This paper addresses the challenge of generating advertising images that not only have high aesthetic quality but also achieve high click-through rates (CTR) in e-commerce platforms. The proposed method, called CAIG (CTR-Driven Advertising Image Generation), leverages multimodal large language models (MLLMs) to create compelling ad visuals that attract user attention and improve advertising effectiveness.

The approach consists of several key components. First, the authors build a pre-training pipeline using a large-scale multimodal e-commerce dataset containing 1.2 million samples from JD.com. This pre-training process involves three main tasks: image understanding (describing products or backgrounds from product images), multimodal content understanding (describing product backgrounds or generating titles from multimodal product information), and prompt generation (generating or rewriting prompts based on multimodal product information).

Second, the paper introduces a novel reward model (RM) based on MLLMs that can accurately predict CTR by comparing pairs of advertising images. The RM is trained on large-scale multimodal online user click data and uses a combination of binary cross-entropy loss and point-level loss to predict relative CTR between image pairs. This approach addresses the limitation of existing CTR prediction methods that have limited visual understanding capabilities and struggle to integrate multimodal features.

Third, the authors propose a CTR-driven preference optimization strategy called PCPO (Product-Centric Preference Optimization). This strategy ensures that the generated background content remains consistent with product features while optimizing for CTR. The optimization process uses Direct Preference Optimization (DPO) as the base strategy but adds the PCPO component to maintain product-background relevance.

The experimental results demonstrate that the proposed method achieves state-of-the-art performance on both commercial and public datasets. The reward model outperforms existing closed-source models (GLM4V, Claude3.5 Sonnet, GPT4o, GPT4V) and open-source models (VAM, CG4CTR) in CTR prediction tasks. The PCPO strategy shows significant improvements over standard DPO, maintaining higher product-background matching rates during training.

Online A/B testing results show that CAIG achieves substantial CTR improvements across 44 product categories, far exceeding the coverage of previous methods. The method demonstrates that more accurate CTR prediction can drive the generation model to produce images with higher CTR while maintaining product relevance.

The paper addresses a critical gap in existing advertising image generation methods by focusing on CTR optimization rather than just aesthetic quality, providing a practical solution for e-commerce platforms to improve advertising effectiveness through AI-generated content.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

CTR Prediction reinforcement learning Preference Optimization advertising image generation Multimodal Large Language Models product-centric design

Written by

JD Tech Talk

Official JD Tech public account delivering best practices and technology innovation.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.