Frontend Development 8 min read

How Midscene.js Leverages Multimodal AI for Zero‑Code UI Automation

Midscene.js, an open‑source UI automation framework from ByteDance’s Web Infra team, combines multimodal AI inference with Chrome extensions, YAML scripts, and JavaScript SDKs to enable zero‑code testing across Web, Android, Playwright, and Puppeteer, offering key interfaces for actions, queries, and assertions.

Software Development Quality

Jun 10, 2025

How Midscene.js Leverages Multimodal AI for Zero‑Code UI Automation

Project Overview

Midscene.js is an open‑source UI automation tool released by ByteDance’s Web Infra team. It leverages multimodal AI inference to let developers quickly build UI automation projects and supports Web, Android, Playwright, Puppeteer and other integration forms.

Zero‑Code Experience with Chrome Extension

Before writing any code, you can try the Chrome extension version of Midscene.js. The extension provides the core interfaces for interaction, data extraction and assertions, allowing you to run a scenario without writing a single line of code.

Install the Midscene plugin and configure the AI service key.

Open the target shopping website.

Enter interaction commands in the plugin, click “Run”, and view the execution result and playback animation.

Click the “Report File” button to obtain a complete replay file that records all steps and AI reasoning, which can be reused in later runs.

In the Query panel you can extract JSON data from the UI by describing the desired content and format, then clicking Run.

The Assert panel provides assertion capabilities.

Three Core Interfaces

.ai

.aiAction

– describe steps and execute interactions. .aiQuery – understand the UI and extract data as JSON. .aiAssert – perform assertions.

Integration Options

YAML Scripts

YAML scripts are easy to read and do not require a large test project, making them suitable for simple verification scenarios.

After setting environment variables, the script can be executed with a single command.

JavaScript SDK for Playwright or Puppeteer

Midscene provides a JavaScript SDK that can be integrated into existing Playwright or Puppeteer scripts.

Model Selection and Costs

Midscene.js does not bind to any specific large‑language‑model provider; you can configure the AI service and model that meet your security requirements.

Doubao-1.5-thinking-vision-pro – visual model on Volcano Engine, best for element positioning and UI understanding.

Qwen-2.5-VL – open‑source visual model from Alibaba Cloud, also available as a commercial deployment.

Other options: GPT-4o, open‑source UI‑TARS, etc.

Details on model selection are available in the documentation.

Advanced Features

Cache – reuse execution results to reduce model calls after the first successful run.

Prompt engineering – techniques to help the model better understand developer intent.

JavaScript optimization – combine large‑scale AI commands with custom JavaScript for efficient workflows.

DOM visibility – flexible methods for extracting data from the page.

Project Information

GitHub repository: https://github.com/web-infra-dev/midscene

Homepage and documentation: https://midscenejs.com/zh

multimodal AI JavaScript Puppeteer UI Automation YAML Playwright

Written by

Software Development Quality

Discussions on software development quality, R&D efficiency, high availability, technical quality, quality systems, assurance, architecture design, tool platforms, test development, continuous delivery, continuous testing, etc. Contact me with any article questions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.