Artificial Intelligence 17 min read

How Multimodal Large Models Can Auto-Generate UI Test Cases End‑to‑End

Leveraging multimodal large‑model AI, this article outlines a four‑stage evolution from text‑based UI element identification to fully autonomous, end‑to‑end generation of executable UI automation scripts, detailing system architecture, intelligent reasoning engine, and real‑world Ctrip hotel refund test case results.

Ctrip Technology

Aug 14, 2025

How Multimodal Large Models Can Auto-Generate UI Test Cases End‑to‑End

Introduction

Traditional UI automation testing requires testers to set up a full development environment, manually locate UI elements, and write test code based on low‑level framework methods. This process is technically demanding, time‑consuming, and error‑prone, especially for complex interfaces. Recent advances in multimodal large models, which can process text, images, and structural data simultaneously, offer a promising solution for intelligent UI test generation.

Evolution of Intelligent UI Automation Test Case Generation

Stage 1: Text‑Attribute Exploration

Early work combined large models with visible text attributes of UI controls (button labels, hints, etc.) to generate test cases. While simple, this approach struggled with dynamic content, non‑textual controls, and lacked robustness.

Stage 2: Unique Identifier‑Based Standardization

To improve stability, a unique ID was assigned to each UI control and managed centrally, enhancing element locating accuracy and test case reliability. However, maintaining IDs introduced significant manual overhead.

Stage 3: Multimodal Information Fusion for Smart Locating

With the rise of multimodal models, visual, structural, and semantic information are jointly processed, enabling precise element recognition without relying on manual IDs. This reduces maintenance cost and improves handling of complex, dynamic UIs, though detailed natural‑language test descriptions are still required.

Stage 4: End‑to‑End Autonomous Generation

The current stage implements a full pipeline that transforms high‑level natural‑language descriptions and page screenshots into executable test code. An autonomous reasoning engine plans test steps, a dynamic execution engine runs them, and an adaptive debugging mechanism iteratively refines failures.

System Architecture of Multimodal Large‑Model UI Automation Generation

The system follows a five‑layer design: User Interaction Layer, API Service Layer, AI Core Layer, Execution Engine Layer, and Infrastructure Layer. Each layer communicates via standardized interfaces to ensure scalability and maintainability.

User Interaction Layer : Provides a friendly front‑end for entering target URLs, natural‑language test descriptions, and visualizing generated results.

API Service Layer : Bridges the front‑end with AI processing, handling code generation requests and real‑time communication with the multimodal model.

AI Core Layer : Acts as the intelligent brain, using multimodal reasoning to convert high‑level descriptions and screenshots into step‑by‑step actions, then generating Python test code via templated, parameterized strategies. A reinforcement‑learning feedback loop optimizes failed steps.

Execution Engine Layer : Executes the generated code, performing precise multimodal element locating, real‑time result verification, and adaptive retries. Execution feedback is sent back to the AI core for dynamic adjustments.

Infrastructure Layer : Supplies stable runtime support based on the company’s internal framework.

Practice Case: Ctrip Hotel Refund

The case demonstrates the end‑to‑end generation of a test script for a hotel order refund scenario, which involves rich UI elements, complex interactions, and strict data validation.

Natural‑language description :

User order is non‑cancellable.

User navigates to the refund application page.

Page shows refund amount with a notice "If approved, the money will be returned via the original route".

User clicks the icon next to the refund amount, triggering a refund‑detail overlay.

Generated Python code :

def runTest(self):
    """初始任务描述：
    Given用户订单为不可取消策略  
    When用户进入申请退款页面  
    Then页面展示退款金额，退款金额下方展示提示"若审核成功，钱款原路退回"  
    And用户点击退款金额右侧的icon  
    Then弹出退款明细浮层
    """
    # 点击协商取消按钮
    self.click(self.findAnyElement("协商取消按钮"), index=0)
    # 点击“去申请”按钮
    self.click(self.findAnyElement("去申请按钮"), index=0)
    # 选择取消原因“行程改变/订错”
    self.click(self.findAnyElement("行程改变/订错"), index=0)
    # 点击“确认”按钮
    self.click(self.findAnyElement("确认"), index=0)
    # 断言页面展示退款金额
    self.assert_exist(self.findAnyElement("申请退款¥8840"), description="页面展示退款金额")
    # 断言页面展示提示“若审核成功，钱款原路退回”
    self.assert_exist(self.findAnyElement("若申请成功，钱款原路退回"), description="页面展示提示“若审核成功，钱款原路退回”")
    # 点击退款金额右侧的icon
    self.click(self.findAnyElement("退款金额信息icon"), index=0)
    # 断言弹出退款明细浮层
    self.assert_exist(self.findAnyElement("退款明细浮层"), description="弹出退款明细浮层")

After integration, the team generated over 7,000 UI test cases with an 80%+ success rate and uncovered more than 300 front‑end defects in the first half of 2025, significantly improving system stability.

Future Planning

Future work will focus on further improving success rates, stability, and cost efficiency, while inviting the community to share experiences and collaborate on advancing AI‑driven UI automation testing.

multimodal AI UI Automation software testing Test case generation

Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.