Artificial Intelligence 12 min read

Multimodal Automatic Layout Generation for E-commerce

The project develops a multimodal automatic layout generation system for e‑commerce by fine‑tuning the qwen‑vl‑7b vision‑language model with LoRA on poster and Taobao image‑layout data, employing diffusion‑based image generation and coordinate‑prediction methods to produce structured layouts that power poster, marketing image, and video‑cover creation with over 90% adoption, while exploring multi‑image, style‑aware, and iterative refinement extensions.

DaTaobao Tech

Mar 12, 2025

Multimodal Automatic Layout Generation for E-commerce

Background: rapid growth of digital content creation demands automated layout generation for posters, ads, and other visual assets. Multimodal techniques combine computer vision and natural language processing to produce layouts that integrate images and text.

Technical routes: (1) Image‑generation based approach uses diffusion models to create layout images, then parses them into structured data. (2) Coordinate‑prediction approach directly predicts normalized element coordinates via diffusion or large language models (LLM), outputting structured layout text.

Data & training: A multimodal large model (qwen‑vl‑7b) was fine‑tuned with LoRA under the deepspeed framework. Training data were gathered from open‑source poster datasets and internal Taobao image‑layout data. Automatic annotation was performed with internvl2, example shown below.

{
  "文本": {
    "主标题": [
      {
        "ocr": "关键包品: 编织手法",
        "box": "[137, 126, 851, 207]",
        "文本语言": "中文",
        "文本主体色调": "黑"
      }
    ],
    "卖点": [
      {
        "ocr": "柔软皮革",
        ...
      }
    ]
  }
}

Model selection balances capability and inference cost; qwen‑vl‑7b was chosen as the base model, with random noise and partial mask augmentation applied to layout boxes to improve robustness.

Business applications: automatic layout technology is widely used in Taobao’s content domain, powering poster generation, marketing image creation, and video cover design. It provides reference layouts for text‑to‑image models and adapts text placement to avoid occluding key visual elements, achieving over 90% adoption in video‑cover generation.

Future directions: multi‑image layout generation, personalized style‑aware layout suggestions, and iterative refinement with human feedback to enhance artistic quality.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

e-commerce Multimodal AI LLM large models diffusion layout generation

Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.