How Midscene.js Leverages Multimodal AI for Zero‑Code UI Automation

Midscene.js, an open‑source UI automation framework from ByteDance’s Web Infra team, combines multimodal AI inference with Chrome extensions, YAML scripts, and JavaScript SDKs to enable zero‑code testing across Web, Android, Playwright, and Puppeteer, offering key interfaces for actions, queries, and assertions.

Software Development Quality
Software Development Quality
Software Development Quality
How Midscene.js Leverages Multimodal AI for Zero‑Code UI Automation

Project Overview

Midscene.js is an open‑source UI automation tool released by ByteDance’s Web Infra team. It leverages multimodal AI inference to let developers quickly build UI automation projects and supports Web, Android, Playwright, Puppeteer and other integration forms.

Midscene.js overview
Midscene.js overview

Zero‑Code Experience with Chrome Extension

Before writing any code, you can try the Chrome extension version of Midscene.js. The extension provides the core interfaces for interaction, data extraction and assertions, allowing you to run a scenario without writing a single line of code.

Install the Midscene plugin and configure the AI service key.

Open the target shopping website.

Enter interaction commands in the plugin, click “Run”, and view the execution result and playback animation.

Click the “Report File” button to obtain a complete replay file that records all steps and AI reasoning, which can be reused in later runs.

Report file panel
Report file panel

In the Query panel you can extract JSON data from the UI by describing the desired content and format, then clicking Run.

Query panel
Query panel

The Assert panel provides assertion capabilities.

Assert panel
Assert panel

Three Core Interfaces

.ai
.aiAction

– describe steps and execute interactions. .aiQuery – understand the UI and extract data as JSON. .aiAssert – perform assertions.

Integration Options

YAML Scripts

YAML scripts are easy to read and do not require a large test project, making them suitable for simple verification scenarios.

YAML script example
YAML script example

After setting environment variables, the script can be executed with a single command.

YAML execution command
YAML execution command

JavaScript SDK for Playwright or Puppeteer

Midscene provides a JavaScript SDK that can be integrated into existing Playwright or Puppeteer scripts.

JS SDK integration example
JS SDK integration example

Model Selection and Costs

Midscene.js does not bind to any specific large‑language‑model provider; you can configure the AI service and model that meet your security requirements.

Doubao-1.5-thinking-vision-pro – visual model on Volcano Engine, best for element positioning and UI understanding.

Qwen-2.5-VL – open‑source visual model from Alibaba Cloud, also available as a commercial deployment.

Other options: GPT-4o, open‑source UI‑TARS, etc.

Details on model selection are available in the documentation.

Advanced Features

Cache – reuse execution results to reduce model calls after the first successful run.

Prompt engineering – techniques to help the model better understand developer intent.

JavaScript optimization – combine large‑scale AI commands with custom JavaScript for efficient workflows.

DOM visibility – flexible methods for extracting data from the page.

Project Information

GitHub repository: https://github.com/web-infra-dev/midscene

Homepage and documentation: https://midscenejs.com/zh

multimodal AIJavaScriptPuppeteerUI AutomationYAMLPlaywright
Software Development Quality
Written by

Software Development Quality

Discussions on software development quality, R&D efficiency, high availability, technical quality, quality systems, assurance, architecture design, tool platforms, test development, continuous delivery, continuous testing, etc. Contact me with any article questions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.