Frontend Development 11 min read

Midscene.js: Multimodal AI‑Powered UI Automation for Web Frontend Testing

Midscene.js, an open‑source UI automation framework from ByteDance Web Infra, leverages multimodal AI to simplify writing, maintaining, and debugging web UI tests with JavaScript or YAML integrations, while discussing its origins, usage patterns, limitations, cost, and security considerations.

ByteDance Web Infra
ByteDance Web Infra
ByteDance Web Infra
Midscene.js: Multimodal AI‑Powered UI Automation for Web Frontend Testing

UI automation for web interfaces has long been a delicate problem; despite mature products like Playwright and Cypress, few teams can keep automation scripts stable due to complex selectors, tight coupling with HTML, and difficult assertions.

Midscene.js is a newly open‑sourced UI automation tool from ByteDance Web Infra that introduces multimodal AI inference to help developers overcome the traditional difficulties of writing and maintaining UI automation.

Origin: Possibilities Brought by Multimodal AI

As multimodal AI continues to evolve, these models acquire content extraction and understanding capabilities that perfectly match the needs of automation testing.

Interface understanding and information extraction:

The Dawn of LMMs: Preliminary Explorations with GPT‑4V(ision)

Paper link: https://arxiv.org/abs/2309.17421

After a series of evaluations we found that using general large models to understand UI and execute automation tests is a completely feasible path. Although large models may hallucinate and struggle with precise coordinate values, these issues can be mitigated with engineering techniques, making the runtime more reliable.

Experience Writing UI Automation with Midscene.js

Using Midscene.js relies on three key methods: interaction ( .ai , .aiAction ), extraction ( .aiQuery ), and assertion ( .aiAssert ).

Specifically:

Use .ai to describe steps and perform interactions.

Use .aiQuery to understand and extract data from the UI; the return value is JSON and can describe any desired data structure.

Use .aiAssert to perform assertions.

Below is a simple test case for a Todo App:

test("ai todo - Chinese Prompt - should fail", async ({
  ai,
  aiQuery,
  aiAssert,
}) => {
  await ai("在任务框 input 输入 今天学习 JS,按回车键");
  await ai("在任务框 input 输入 明天学习 Rust,按回车键");
  await ai("在任务框 input 输入 后天学习 AI,按回车键");
  await ai("将鼠标移动到任务列表中的第二项,点击第二项任务右边的删除按钮");
  await ai("点击第二条任务左边的勾选按钮");
  await ai("点击任务列表下面的 completed 状态按钮");

  // 提取任意格式的数据
  const list = await aiQuery("string[], 完整的任务列表");
  expect(list.length).toEqual(1);

  await aiAssert('页面底部显示有 "1 item left"');
});

Running the above case provides a full operation replay to help developers diagnose the execution process.

If you want to adjust the prompts, you can repeatedly rerun them in the Playground environment.

JavaScript or Yaml, Multiple Integration Forms

Midscene.js supports JavaScript integration with Puppeteer and Playwright, and also supports describing workflows in YAML, enabling near zero‑code testing for simple site checks and build‑artifact inspections.

Here is a practical YAML script example:

# 在 Ebay 网站搜索 headphone,提取 JSON 格式的结果数据,并断言

target:
  url: https://www.ebay.com
  output: ./output/ebay-headphones.json

tasks:
  - name: search headphones
    flow:
      - aiAction: type 'Headphones' in search box, hit Enter
      - aiWaitFor: there is at least one headphone item on page
        timeout: 10000

  - name: extract headphones info
    flow:
      - aiQuery: >
          {name: string, price: number, actionBtnName: string}[], return item name, price and the action button name on the lower right corner of each item (like 'Remove')
        name: headphones

  - name: assert shopping cart icon
    flow:
      - aiAssert: There is a shopping cart icon on the top right

After setting environment variables, running it only requires a single command:

midscene ./bing-search.yaml

Start with the Chrome Extension

To let developers quickly evaluate Midscene.js on real sites, a Chrome extension version is published. It can be used on any site without writing code.

Installation and configuration instructions are available at https://midscenejs.com/zh/quick-experience.html .

Limitations and Shortcomings

Calling large‑model services incurs inference latency, making real‑time debugging less ideal.

Interaction types are limited to click, input, keyboard, and scroll operations.

Precise prompts are required; even GPT‑4o cannot guarantee 100 % correct answers. Developers need prompt‑engineering techniques such as providing detailed descriptions and examples.

Data Security and Cost

Midscene.js defaults to GPT‑4o for inference but is not tied to any specific LLM provider. You can configure your own AI service and model, ensuring that no third party receives your page content.

For the sample project, the runtime cost on the gpt‑4o‑08‑06 model (without prompt caching) is as follows:

Task

Resolution

Prompt Tokens / Price

Completion Tokens / Price

Total Price

Plan and execute a search on eBay

1280x800

6005 / $0.0150125

146 / $0.00146

$0.02

Query eBay search results

1280x800

9107 / $0.0227675

122 / $0.00122

$0.02

Measurement time: November 2024

Observations on UI Automation

Many teams and developers are interested in UI automation; the demand exceeds expectations. Previously, concerns about maintainability kept many from participating, but large models are changing that.

Lowering test‑authoring cost will encourage more developers to write test cases, expanding the responsibility beyond QA.

LLM services incur some cost, which many teams have not budgeted for, but the investment is justified compared to the cost of bugs in production.

Open‑source models are reaching impressive levels; future iterations will continue to add support for newer models, further improving the UI automation experience.

Midscene.js Related Information

GitHub: https://github.com/web-infra-dev/midscene

Homepage and documentation: https://midscenejs.com/

multimodal AIJavaScriptUI automationtestingPlaywrightMidscene.js
ByteDance Web Infra
Written by

ByteDance Web Infra

ByteDance Web Infra team, focused on delivering excellent technical solutions, building an open tech ecosystem, and advancing front-end technology within the company and the industry | The best way to predict the future is to create it

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.