Artificial Intelligence 12 min read

How UI-UG Unifies UI Understanding and Generation with a 7B Multimodal Model

The open‑source UI‑UG‑7B multimodal model from Alipay combines UI understanding and generation in a single framework, delivering state‑of‑the‑art performance across referring, grounding, captioning, and code generation tasks while dramatically speeding up UI creation for developers.

Alipay Experience Technology

Sep 30, 2025

How UI-UG Unifies UI Understanding and Generation with a 7B Multimodal Model

Introduction

In the mobile‑internet era, user interface (UI) is the bridge connecting users to the digital world, directly influencing experience and business conversion.

Recent research such as Apple’s Ferret‑UI, Microsoft’s OmniParser, and papers like Web2Code and DCGen focus on either UI understanding or UI generation, but not both.

Alipay Experience Technology Department releases UI‑UG, the first open‑source multimodal model that unifies UI understanding and generation, achieving outstanding performance in both scenarios.

UI‑UG‑7B is now open‑sourced:

HuggingFace: https://huggingface.co/neovateai/UI-UG-7B

GitHub: https://github.com/neovateai/UI-UG

Paper: https://arxiv.org/abs/2509.24361

Model Overview: A New Paradigm for UI Intelligence

UI Understanding and Generation: Twin Tasks

The core idea is that understanding and generation are complementary; joint training enables information sharing that improves both tasks.

Four Core Capabilities of UI‑UG

UI‑UG covers Referring, Grounding, Captioning, and Generation.

Referring : Precisely describe a specified image region, identify UI element types (buttons, icons, dialogs), extract OCR text, recognize colors, etc.

Grounding : Detect all UI elements in an image, locate specific types, and recognize basic components (text, images, icons) as well as interactive elements (close button, back button).

Captioning : Produce a structured description of the entire UI image following a predefined format, providing a foundation for downstream analysis and generation.

Generation : Generate UI DSL code from textual description and reference image, supporting dynamic data binding and progressive rendering; in internal tests, card‑level UI generation can be completed in as little as 5 seconds.

Technical Breakthroughs: Data, Training, and Optimization

The UI‑UG workflow consists of extensive data preparation and a two‑stage training process (Supervised Fine‑Tuning (SFT) and Reinforcement Learning (RL)) with innovations at each stage.

Data Construction: Diverse Real‑World Dataset

UI‑UG abandons outdated open‑source datasets and builds a new collection of over 30,000 UI pages from various business scenarios, de‑duplicated via visual embeddings to ensure diversity. It expands UI type definitions to include pop‑ups, checkboxes, and other high‑frequency components, and uses a CV model for pre‑labeling followed by human verification.

For generation, UI‑UG creates a pipeline: pages are sliced into card‑level components, a powerful multimodal model (e.g., Qwen2.5‑VL‑72B) generates component descriptions, simulates user requirements, and produces UI. The approach uses a UI DSL rather than end‑to‑end code, and applies text erasure and style‑mixing techniques to boost style‑transfer capability.

Two‑Stage Training: From Basic Learning to Precise Optimization

SFT stage : Based on Qwen2.5‑VL‑7B, the ViT module is frozen while the LLM and vision‑language adapter are trained on 180 k VQA examples for three epochs on eight A100 GPUs.

Reinforcement Learning stage : Improves each task with specialized methods:

Referring : GRPO method focuses on eight hard‑sample component classes, enhancing classification accuracy and format compliance.

Grounding : GRPO with a dual‑IoU reward optimizes both recall and precision.

Generation : DPO method uses 8,000 preference pairs covering visual structure, color aesthetics, textual consistency, and interactivity, markedly improving generation quality.

The final model undergoes sequential optimization: DPO (generation) → GRPO (referring) → GRPO (grounding).

Experimental Validation: Leading the Vertical Domain

Benchmarks on UI understanding and UI generation compare UI‑UG against general multimodal models (GPT‑4o, Claude 3.7 Sonnet, Gemini 2.5 Pro, Qwen2.5‑VL series) and specialized UI models (Apple’s Ferret‑UI2, Microsoft’s OmniParser V2), plus ablation studies on data preparation, SFT strategies, and RL impact.

UI Understanding Leads the Field

In the Referring task, UI‑UG achieves state‑of‑the‑art accuracy for element classification, OCR, and color recognition, far surpassing GPT‑4o, Claude 3.7, Gemini 2.5, and even Ferret‑UI2; detection mAP exceeds OmniParser V2.

Significant advantage over closed‑source models, which struggle with region awareness and rare element detection.

Referring performance outperforms Ferret‑UI2; detection mAP beats OmniParser V2 and other visual models.

In Grounding, UI‑UG also reaches SOTA, with RL boosting mAP by 4.6% and markedly improving bounding‑box precision.

UI Generation Balances Efficiency and Quality

Across six evaluation dimensions (format correctness, visual similarity, and four UI‑generation metrics), UI‑UG ranks first in format stability, ensuring industrial‑grade usability.

DPO training raises the generation quality score by 14.5% (36.7 → 42.02).

Generation quality approaches that of the source model Qwen2.5‑VL‑72B while being 4–6× faster.

Deployed on two Nvidia L20 GPUs, average response time is 5–6 seconds, compared with 20–30 seconds for other general models.

Key Insight: Value of Joint Training

Ablation studies confirm the scientific merit of the approach:

Adapter + LLM fine‑tuning yields the best balance.

Mixed training data improves both understanding and generation tasks.

GRPO/DPO bring clear metric gains for their respective tasks.

Future Outlook: Continuous Evolution

Planned directions include:

Finer granularity: hierarchical components, composite layouts, interactive states.

Lighter models: exploring smaller architectures and quantization for near‑real‑time "one‑sentence UI generation".

More understanding: captioned grounding and multi‑turn conversational editing.

Open collaboration: the project is open‑source and invites community contributions.

As the first unified UI understanding and generation model, UI‑UG provides an AI‑for‑Frontend tool and opens new possibilities for GUI agents and conversational UI generation, paving the way for smarter human‑computer interaction.

Artificial Intelligence UI Generation multimodal model UI Understanding

Written by

Alipay Experience Technology

Exploring ultimate user experience and best engineering practices

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.