Artificial Intelligence 9 min read

CapOnImage: Context-driven Dense Captioning on Images

The paper presents CapOnImage, a novel image‑on‑image captioning task that generates location‑specific decorative text for product images, introduces the 2.1‑million‑image CapOnImage2M dataset, and proposes a mixed‑modality transformer with position‑aware pre‑training and progressive training, achieving superior accuracy and diversity and already deployed in Alibaba’s advertising platforms for measurable business impact.

Alimama Tech
Alimama Tech
Alimama Tech
CapOnImage: Context-driven Dense Captioning on Images

This paper introduces a new task called "image-on-image caption generation" (CapOnImage), which aims to generate decorative textual captions for specific locations on product images to enhance advertising effectiveness.

Existing captioning systems produce text unrelated to image regions, limiting their use in ad scenarios. To address this, the authors construct a large-scale dataset, CapOnImage2M, containing 2.1 million product images with titles, attributes, and location-specific captions.

The proposed model leverages multimodal context—including image content, product metadata, layout coordinates, and neighboring box information—through a mixed-modality transformer that generates captions autoregressively. Several position-aware pre‑training tasks (Level‑I, Level‑II, Level‑III) and a progressive training strategy are designed to help the model understand spatial relationships.

Experiments show that the model outperforms baseline image‑text description methods in both accuracy and diversity. Ablation studies confirm the effectiveness of each pre‑training task and the progressive training scheme. Visualizations demonstrate that generated captions align well with the intended image regions.

The work has already been deployed in Alibaba’s advertising platforms (e.g., homepage focus slots and recommendation feeds), yielding significant business gains. The authors anticipate future end‑to‑end text rendering without separate layout prediction modules.

advertisingMultimodaldeep learningcontext-awaredatasetimage captioning
Alimama Tech
Written by

Alimama Tech

Official Alimama tech channel, showcasing all of Alimama's technical innovations.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.