Multimodal Automatic Layout Generation for E-commerce
The project develops a multimodal automatic layout generation system for e‑commerce by fine‑tuning the qwen‑vl‑7b vision‑language model with LoRA on poster and Taobao image‑layout data, employing diffusion‑based image generation and coordinate‑prediction methods to produce structured layouts that power poster, marketing image, and video‑cover creation with over 90% adoption, while exploring multi‑image, style‑aware, and iterative refinement extensions.
Background: rapid growth of digital content creation demands automated layout generation for posters, ads, and other visual assets. Multimodal techniques combine computer vision and natural language processing to produce layouts that integrate images and text.
Technical routes: (1) Image‑generation based approach uses diffusion models to create layout images, then parses them into structured data. (2) Coordinate‑prediction approach directly predicts normalized element coordinates via diffusion or large language models (LLM), outputting structured layout text.
Data & training: A multimodal large model (qwen‑vl‑7b) was fine‑tuned with LoRA under the deepspeed framework. Training data were gathered from open‑source poster datasets and internal Taobao image‑layout data. Automatic annotation was performed with internvl2, example shown below.
{
"文本": {
"主标题": [
{
"ocr": "关键包品: 编织手法",
"box": "[137, 126, 851, 207]",
"文本语言": "中文",
"文本主体色调": "黑"
}
],
"卖点": [
{
"ocr": "柔软皮革",
...
}
]
}
}Model selection balances capability and inference cost; qwen‑vl‑7b was chosen as the base model, with random noise and partial mask augmentation applied to layout boxes to improve robustness.
Business applications: automatic layout technology is widely used in Taobao’s content domain, powering poster generation, marketing image creation, and video cover design. It provides reference layouts for text‑to‑image models and adapts text placement to avoid occluding key visual elements, achieving over 90% adoption in video‑cover generation.
Future directions: multi‑image layout generation, personalized style‑aware layout suggestions, and iterative refinement with human feedback to enhance artistic quality.
DaTaobao Tech
Official account of DaTaobao Technology
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.