CapOnImage: Context-driven Dense Captioning on Images

The paper presents CapOnImage, a novel image‑on‑image captioning task that generates location‑specific decorative text for product images, introduces the 2.1‑million‑image CapOnImage2M dataset, and proposes a mixed‑modality transformer with position‑aware pre‑training and progressive training, achieving superior accuracy and diversity and already deployed in Alibaba’s advertising platforms for measurable business impact.

Alimama Tech
Alimama Tech
Alimama Tech
CapOnImage: Context-driven Dense Captioning on Images

This paper introduces a new task called "image-on-image caption generation" (CapOnImage), which aims to generate decorative textual captions for specific locations on product images to enhance advertising effectiveness.

Existing captioning systems produce text unrelated to image regions, limiting their use in ad scenarios. To address this, the authors construct a large-scale dataset, CapOnImage2M, containing 2.1 million product images with titles, attributes, and location-specific captions.

The proposed model leverages multimodal context—including image content, product metadata, layout coordinates, and neighboring box information—through a mixed-modality transformer that generates captions autoregressively. Several position-aware pre‑training tasks (Level‑I, Level‑II, Level‑III) and a progressive training strategy are designed to help the model understand spatial relationships.

Experiments show that the model outperforms baseline image‑text description methods in both accuracy and diversity. Ablation studies confirm the effectiveness of each pre‑training task and the progressive training scheme. Visualizations demonstrate that generated captions align well with the intended image regions.

The work has already been deployed in Alibaba’s advertising platforms (e.g., homepage focus slots and recommendation feeds), yielding significant business gains. The authors anticipate future end‑to‑end text rendering without separate layout prediction modules.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Deep LearningMultimodalContext-AwareDatasetImage Captioning
Alimama Tech
Written by

Alimama Tech

Official Alimama tech channel, showcasing all of Alimama's technical innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.