Artificial Intelligence 21 min read

Midscene.js: An AI‑Driven UI Automation Framework from ByteDance

Midscene.js is an open‑source UI automation framework that leverages multimodal AI to simplify web UI testing and interaction, offering three core interfaces—Action, Query, and Assert—along with a JavaScript SDK, support for multiple AI models, YAML scripting, and future‑focused features for stable, scalable automation.

ByteDance Web Infra

Mar 21, 2025

Midscene.js: An AI‑Driven UI Automation Framework from ByteDance

Midscene.js, developed by ByteDance's Web Infra team, is a newly open‑source UI automation tool that integrates multimodal AI capabilities to overcome the difficulties of traditional UI automation scripting and maintenance.

Origin and AI × UI Automation Demo: The talk begins with a demo showing how AI can analyze a UI step‑by‑step, selecting fields, copying data, and completing tasks autonomously.

Challenges of Traditional Web UI Automation: Conventional automation relies on selectors (ID, class, XPath) and tightly couples test code with business logic, leading to fragility after refactoring.

AI‑Driven Solution: By using natural‑language instructions, AI can plan actions, locate elements, and interact with them, decoupling tests from code and improving maintainability.

Core Interfaces (1/3 – Action, 2/3 – Query, 3/3 – Assert):

Action: Executes interactions such as typing a keyword, clicking a result, or posting a tweet based on step‑by‑step prompts.

Query: Extracts information from the UI and returns structured JSON data, enabling flexible data retrieval.

Assert: Validates UI state against expectations, e.g., confirming a page title or element presence.

JavaScript SDK: Midscene.js can be combined with Puppeteer or Playwright; all interface inputs are natural language, and the SDK handles the underlying AI calls.

Supported AI Models:

GPT‑4o (OpenAI, closed‑source, high cost, token‑heavy).

Qwen‑2.5‑VL (Alibaba, open‑source, cost‑effective, strong image understanding).

UI‑TARS (ByteDance’s own UI‑focused model, high accuracy, requires self‑hosted GPU resources).

Additional Features: YAML‑based automation scripts for CI‑style workflows, aiWaitFor for waiting on UI conditions, bridge mode for desktop Chrome interactions, and LangSmith integration for debugging.

Principles and Model Architecture: The system performs OCR, element localization, interaction understanding, and step planning. It currently relies on external models rather than a proprietary one.

Practical Outlook: Emphasizes the trade‑off between model creativity and stability, the “impossible triangle” of cost, speed, and quality, and predicts AI‑driven UI automation becoming a foundational capability by 2025.

For more information, visit the project site https://midscenejs.com/ and the GitHub repository https://github.com/web-infra-dev/midscene .

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Multimodal AI JavaScript AI UI automation Midscene.js

Written by

ByteDance Web Infra

ByteDance Web Infra team, focused on delivering excellent technical solutions, building an open tech ecosystem, and advancing front-end technology within the company and the industry | The best way to predict the future is to create it

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.