Frontend Development 9 min read

Midscene.js: An AI‑Powered UI Automation Framework for Web Testing

Midscene.js leverages multimodal AI to simplify web UI automation by providing .ai, .aiQuery and .aiAssert methods, supporting JavaScript and YAML integrations, a Chrome extension, and detailed cost analysis while acknowledging latency, interaction limits, and prompt‑engineering challenges.

Full-Stack Cultivation Path

Dec 18, 2024

Midscene.js: An AI‑Powered UI Automation Framework for Web Testing

Why UI automation is hard

Web UI automation suffers from fragile selectors, tight coupling to HTML structure, and difficult assertions, which makes stable, maintainable scripts rare despite tools like Playwright and Cypress.

Origin: multimodal AI makes it possible

Advances in multimodal large models give them the ability to extract and understand visual content, a capability that directly addresses the challenges of UI automation.

How Midscene.js works

Midscene.js, open‑sourced by ByteDance Web Infra, introduces four key AI‑driven methods: .ai – describe a step and perform the interaction. .aiQuery – extract data from the UI, returning JSON. .aiAssert – perform assertions on the page. .aiAction – a synonym for interaction commands.

Sample test for a Todo app

test("ai todo - Chinese Prompt - should fail", async ({
  ai,
  aiQuery,
  aiAssert,
}) => {
  await ai("在任务框 input 输入 今天学习 JS，按回车键");
  await ai("在任务框 input 输入 明天学习 Rust，按回车键");
  await ai("在任务框 input 输入 后天学习 AI，按回车键");
  await ai("将鼠标移动到任务列表中的第二项，点击第二项任务右边的删除按钮");
  await ai("点击第二条任务左边的勾选按钮");
  await ai("点击任务列表下面的 completed 状态按钮");

  // 提取任意格式的数据
  const list = await aiQuery("string[], 完整的任务列表");
  expect(list.length).toEqual(1);

  await aiAssert('页面底部显示有 "1 item left"');
});

Running this test produces a full replay video to help diagnose failures, and prompts can be iterated quickly in the Playground.

Integration styles

Midscene.js can be used directly in JavaScript with Puppeteer or Playwright, or described declaratively in YAML, which can reduce test‑setup effort to near‑zero for simple site checks.

# 在 Ebay 网站搜索 headphone，提取 JSON 格式的结果数据，并断言

target:
  url: https://www.ebay.com
  output: ./output/ebay-headphones.json

tasks:
  - name: search headphones
    flow:
      - aiAction: type 'Headphones' in search box, hit Enter
      - aiWaitFor: there is at least one headphone item on page
        timeout: 10000

  - name: extract headphones info
    flow:
      - aiQuery: >
          {name: string, price: number, actionBtnName: string}[], return item name, price and the action button name on the lower right corner of each item (like 'Remove')
        name: headphones

  - name: assert shopping cart icon
    flow:
      - aiAssert: There is a shopping cart icon on the top right

After setting environment variables, the script runs with a single command:

midscene ./bing-search.yaml

Chrome extension

A Chrome plugin version lets developers try Midscene.js on any site without writing code.

Limitations

LLM inference latency makes real‑time debugging less smooth.

Supported interactions are limited to click, input, keyboard and scroll.

Precise prompts are required; even GPT‑4o can return incorrect answers, so prompt‑engineering tips (detailed description + examples) are recommended.

Data security and cost

Midscene.js defaults to GPT‑4o but can be configured to use any compliant AI provider, ensuring page content never leaves the user’s environment.

Sample cost for the eBay demo (GPT‑4o‑08‑06, no caching):

Plan & search on eBay (1280×800): 6005 prompt tokens ($0.0150125) + 146 completion tokens ($0.00146) ≈ $0.02

Query product info (1280×800): 9107 prompt tokens ($0.0227675) + 122 completion tokens ($0.00122) ≈ $0.02

Measurement taken in November 2024.

Observations on UI automation

Interest in UI automation is higher than expected; many developers were hesitant due to maintainability, but LLMs are changing that.

Lower authoring cost will draw more developers into test writing, expanding the role beyond QA.

LLM calls incur modest fees; teams unfamiliar with paid scripts may need to adjust expectations, but the ROI of sustainable automation outweighs the cost.

Open‑source models are improving, and future releases will add broader model support.

More information

GitHub: https://github.com/web-infra-dev/midscene

Homepage & docs: https://midscenejs.com/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

multimodal AI JavaScript UI automation LLM Chrome Extension yaml Midscene.js

Written by

Full-Stack Cultivation Path

Focused on sharing practical tech content about TypeScript, Vue 3, front-end architecture, and source code analysis.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.