How Midscene.js Leverages Multimodal AI for Zero‑Code UI Automation
Midscene.js, an open‑source UI automation framework from ByteDance’s Web Infra team, combines multimodal AI inference with Chrome extensions, YAML scripts, and JavaScript SDKs to enable zero‑code testing across Web, Android, Playwright, and Puppeteer, offering key interfaces for actions, queries, and assertions.
Project Overview
Midscene.js is an open‑source UI automation tool released by ByteDance’s Web Infra team. It leverages multimodal AI inference to let developers quickly build UI automation projects and supports Web, Android, Playwright, Puppeteer and other integration forms.
Zero‑Code Experience with Chrome Extension
Before writing any code, you can try the Chrome extension version of Midscene.js. The extension provides the core interfaces for interaction, data extraction and assertions, allowing you to run a scenario without writing a single line of code.
Install the Midscene plugin and configure the AI service key.
Open the target shopping website.
Enter interaction commands in the plugin, click “Run”, and view the execution result and playback animation.
Click the “Report File” button to obtain a complete replay file that records all steps and AI reasoning, which can be reused in later runs.
In the Query panel you can extract JSON data from the UI by describing the desired content and format, then clicking Run.
The Assert panel provides assertion capabilities.
Three Core Interfaces
.ai .aiAction– describe steps and execute interactions. .aiQuery – understand the UI and extract data as JSON. .aiAssert – perform assertions.
Integration Options
YAML Scripts
YAML scripts are easy to read and do not require a large test project, making them suitable for simple verification scenarios.
After setting environment variables, the script can be executed with a single command.
JavaScript SDK for Playwright or Puppeteer
Midscene provides a JavaScript SDK that can be integrated into existing Playwright or Puppeteer scripts.
Model Selection and Costs
Midscene.js does not bind to any specific large‑language‑model provider; you can configure the AI service and model that meet your security requirements.
Doubao-1.5-thinking-vision-pro – visual model on Volcano Engine, best for element positioning and UI understanding.
Qwen-2.5-VL – open‑source visual model from Alibaba Cloud, also available as a commercial deployment.
Other options: GPT-4o, open‑source UI‑TARS, etc.
Details on model selection are available in the documentation.
Advanced Features
Cache – reuse execution results to reduce model calls after the first successful run.
Prompt engineering – techniques to help the model better understand developer intent.
JavaScript optimization – combine large‑scale AI commands with custom JavaScript for efficient workflows.
DOM visibility – flexible methods for extracting data from the page.
Project Information
GitHub repository: https://github.com/web-infra-dev/midscene
Homepage and documentation: https://midscenejs.com/zh
Software Development Quality
Discussions on software development quality, R&D efficiency, high availability, technical quality, quality systems, assurance, architecture design, tool platforms, test development, continuous delivery, continuous testing, etc. Contact me with any article questions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
