Uni-Layout: Unified Cross-Task Layout Generation with Human-Aligned Evaluation

Uni-Layout introduces a unified layout generation framework that consolidates diverse design tasks, leverages multimodal large language models for flexible generation, and aligns outputs with human perception through a novel human‑feedback dataset (Layout‑HF100k) and a dynamic margin preference optimization (DMPO) evaluator.

JD Retail Technology
JD Retail Technology
JD Retail Technology
Uni-Layout: Unified Cross-Task Layout Generation with Human-Aligned Evaluation

Background and Motivation

Layout generation is essential for designing attractive visual compositions in e‑commerce, posters, UI, and magazines. Existing methods are task‑specific and rely on evaluation metrics that often diverge from human perception, limiting their applicability.

Unified Layout Generation (Uni-Layout)

To address these issues, Uni-Layout proposes a unified generator that can handle multiple layout tasks through a single multimodal large language model (MLLM). The generator accepts a natural‑language instruction describing the task, background (b) and element (e) constraints, and produces coherent layouts even when background or element content is missing.

The taxonomy organizes tasks along two dimensions—whether background and element content are free or constrained—resulting in four representative types: BFEF, BCEF, BFEC, and BCEC. A generic instruction format is: T b e O where T is the task description, b and e encode background and element attributes (required but may be empty), and O specifies the output format.

Human‑Feedback Dataset (Layout‑HF100k)

Since existing benchmarks lack human judgments on layout quality, the authors compiled Layout‑HF100k, a 100‑k example dataset with meticulous human annotations indicating whether each layout is acceptable. This dataset enables training and evaluating models that mimic human aesthetic judgments.

Human‑Like Evaluator with Chain‑of‑Thought (CoT)

The evaluator processes layouts through two branches—visual and geometric—to simulate human assessment. It outputs a confidence score and a qualitative reasoning chain consisting of four steps:

Layout Overview: A brief textual summary of the overall composition.

Spatial Deconstruction: Analysis of geometric properties, alignment, and spacing.

Aesthetic Evaluation: Assessment of visual balance, harmony, and design principles.

Comprehensive Judgment: Final “qualified” or “unqualified” decision.

This CoT mechanism provides interpretable explanations akin to human experts.

Dynamic Margin Preference Optimization (DMPO)

Traditional alignment methods treat all human preferences equally. DMPO adapts the margin between candidate layouts based on the strength of human preference: stronger preferences increase the margin, weaker ones use a smaller margin. The score difference between two candidates l_1 and l_2 is computed as:

Score difference formula
Score difference formula

The final DMPO loss applies a nonlinear transformation f() to the margin‑adjusted score difference, encouraging the generator to produce layouts that better match nuanced human preferences.

Experiments

Evaluation Model Performance

The proposed evaluator was compared against leading closed‑source multimodal LLMs (GPT‑4o, Claude‑3.5 Sonnet, GLM‑4v, DeepSeek‑R1) using an “LLM‑as‑Judge” protocol. Uni‑Layout’s evaluator achieved 85.5% accuracy, surpassing the baselines by 25‑35%.

Evaluation model comparison table
Evaluation model comparison table

Layout Generation Model Performance

Uni‑Layout was benchmarked against task‑specific SOTA models (e.g., LayoutDM), closed‑source models (GPT‑4o, Claude‑3.5, DeepSeek‑R1), and open‑source multimodal LLMs (LLaVA). Across four task categories, Uni‑Layout achieved the lowest Overlap (Ove) and Alignment (Ali) errors and the highest recall/completeness metrics, setting new records on BFEC and BCEC tasks.

Generation model metrics table
Generation model metrics table

Human Simulation Evaluation

Using the Layout‑HF100k dataset, the authors measured the LR (likelihood‑ratio) score of their evaluator. Uni‑Layout attained the highest LR of 0.702, outperforming GPT‑4o (0.584), Claude‑3.5 (0.575), DeepSeek‑R1 (0.401), and the open‑source baseline LLaVA (0.422). Compared with the best existing methods (LayoutFlow, P&R, Poster‑Llama) which average 0.658, Uni‑Layout shows a clear advantage.

Human simulation LR scores
Human simulation LR scores

Overall, Uni‑Layout demonstrates that a unified generation‑evaluation pipeline, enriched with large‑scale human feedback and adaptive alignment, can substantially improve layout quality and better reflect human aesthetic judgments.

evaluationlayout generationmultimodal LLMHuman FeedbackACM Multimediadynamic margin optimization
JD Retail Technology
Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.