How Multimodal AI Transforms Advertising Copy: From Image Text to Video Scripts

Alibaba’s advertising AI team presents a comprehensive study of four new multimodal copywriting tasks—image overlay text generation, video narration, text style transfer, and detail-page extraction—detailing model architectures, training on billions of images, experimental results, and practical deployment in the “Xiyu” product.

Alimama Tech
Alimama Tech
Alimama Tech
How Multimodal AI Transforms Advertising Copy: From Image Text to Video Scripts

1. Image Overlay Copy Generation

Background : Advertising images often need decorative text boxes (product name, selling points, call‑to‑action) to increase conversion. Manual creation is costly and template‑based methods cannot adapt to diverse layouts or spatial relationships between text and visual elements.

Method

A context‑aware multimodal model receives the following inputs for each target box: the raw image, the current box coordinates, neighboring box coordinates, product category, product title, and attribute key‑value pairs. Each input is embedded (visual features from a CNN, positional embeddings for discretized box locations, and token embeddings for textual data) and fed into a multi‑layer transformer. The transformer generates the text for the target box autoregressively. Box coordinates are discretized into a fixed grid (e.g., a 5×5 patch grid) and encoded as patch IDs. During training, text regions on the image are masked so the model cannot simply perform OCR. The model is trained on a dataset of ~1 billion product images.

Results

After training, the model produces fluent, semantically appropriate copy that meets deployment quality thresholds. Sample outputs on previously blank images show that the generated text aligns with upstream layout predictions and respects spatial constraints.

2. Unsupervised Text Style Transfer

Background : Advertising copy must adapt to platform‑specific styles (tone, sentiment, gendered language). Existing unsupervised style‑transfer frameworks often suffer from content loss or insufficient style fidelity.

Method

The baseline architecture consists of an encoder, a discriminator, and two decoders (source‑style and target‑style). The encoder abstracts semantic content while stripping style attributes. The discriminator is trained with a negative‑log‑likelihood loss to distinguish representations from different style domains. Improvements include:

Multiple corruption strategies (random deletion, span masking) for robust reconstruction.

Pseudo‑parallel data generation via latent‑vector retrieval to augment training.

Additional encoder tasks (e.g., contrastive learning) and discriminator optimization for richer feature spaces.

Partial parameter sharing in the target decoder to preserve content while injecting style.

A dynamic training schedule that gradually shifts emphasis from reconstruction to adversarial objectives.

Results

Compared with the baseline, the enhanced model achieves higher relevance scores and readability metrics (e.g., BLEU, ROUGE) while preserving original content. Qualitative examples demonstrate clearer, more coherent style‑adapted copy.

3. Detail‑Page Copy Extraction

Background : E‑commerce product detail pages contain rich textual information spread across many images without explicit text or layout metadata. Extracting concise copy snippets is essential for downstream advertising tasks.

Method

The pipeline consists of four stages:

OCR : Use DuGuang OCR to obtain text blocks with bounding boxes.

Filtering : Apply rule‑based heuristics to discard non‑copy elements (e.g., navigation icons, decorative graphics).

Self‑supervised clustering : Train a font‑embedding model; compute cosine similarity between block embeddings. Merge adjacent blocks whose font similarity exceeds a threshold and whose spatial distance is below a configurable limit, forming semantic units.

Ordering & punctuation : Sort merged units by natural reading order (left‑to‑right, top‑to‑bottom) and feed the sequence into a BERT‑based punctuation‑prediction model to produce complete sentences.

Candidate sentences are scored by a GPT‑style language model (perplexity) to filter out incomplete or low‑quality text. A BERT classifier, trained on >100 k manually labeled examples covering styles such as “product selling point”, assigns each candidate to a copy category.

Experiment & Results

Evaluation on high‑traffic products (daily page views > 10) shows that the pipeline extracts concise, high‑quality copy snippets suitable for advertising. Precision and recall improvements over a naïve OCR‑only baseline exceed 20 %.

4. Video Long‑Form Copy Generation

Background : Video advertising requires coherent, factual, and lengthier narration copy. Small advertisers often lack resources to author such copy.

Method

A hierarchical multimodal transformer is employed:

Storyline decoder (lower layer) generates a sequence of keywords that outline the narrative.

Text decoder (upper layer) expands the keyword storyline into full sentences.

An auxiliary image‑attribute classification task balances contributions from visual inputs (product images), textual inputs (product title), and structured attribute tables (key‑value pairs), ensuring factual consistency and preventing hallucinated claims.

The model processes multimodal inputs jointly, with visual features extracted by a CNN, textual embeddings from the title, and attribute embeddings from a learned table encoder. Training uses tens of millions of product samples; the loss combines cross‑entropy for text generation, classification loss for the auxiliary task, and a factuality regularizer that penalizes generated statements not supported by the input attributes.

Results

Compared with the previous baseline, the hierarchical model more than doubles the accuracy of generated copy (measured by human‑rated factuality and relevance). The system has been deployed in multiple video creation tools, delivering significant efficiency gains for advertisers.

References

[1] Yiqi Gao, Xinglin Hou, Yuanmeng Zhang, Tiezheng Ge, Yuning Jiang, Peng Wang. "CapOnImage: Context‑driven Dense‑Captioning on Image." https://arxiv.org/pdf/2204.12974.pdf

[2] Zhipeng Zhang, Xinglin Hou, Kai Niu, Zhongzhen Huang, Tiezheng Ge, Yuning Jiang, Qi Wu, Peng Wang. "Attract me to Buy: Advertisement Copywriting Generation with Multimodal Multi‑structured Information." https://arxiv.org/pdf/2205.03534.pdf

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

multimodal AItext generationStyle Transferlarge-scale trainingvideo narrationadvertising copydetail extraction
Alimama Tech
Written by

Alimama Tech

Official Alimama tech channel, showcasing all of Alimama's technical innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.