Artificial Intelligence 31 min read

How Midscene.js Uses AI to Transform UI Automation: Architecture, Workflow, and Real‑World Tips

This article systematically introduces Midscene.js, an AI‑powered next‑generation UI automation tool, covering its design motivations, core architecture, UI context acquisition, LLM‑driven planning, element verification strategies, Chrome extension implementation, common pitfalls, and practical business insights.

Alibaba Cloud Developer

Nov 4, 2025

How Midscene.js Uses AI to Transform UI Automation: Architecture, Workflow, and Real‑World Tips

1. Midscene.js Introduction

Midscene.js is an AI‑based UI automation framework that addresses the fragility of traditional tools relying on CSS selectors, XPath, or IDs, which break when pages change, require costly maintenance, and lack visual debugging.

1.1 Why Traditional UI Automation Fails

Selector brittleness : CSS/XPath selectors become invalid with page changes.

High maintenance cost : Frequent script rewrites for dynamic content, animations, and asynchronous flows.

Poor debugging experience : Hard to locate failing scripts and understand errors.

Limited cross‑platform support : Many tools only work on desktop browsers.

1.2 AI‑Driven Solution

With advances in visual‑language (VL) models, Midscene.js lets users describe actions in natural language, letting the AI understand page content and intent, generate stable element IDs, and execute actions without fragile selectors.

2. Chrome Extension Trial

The extension provides a zero‑code way to try Midscene.js. Users input instructions like “search for ‘slippers’ on Taobao and press Enter.” The extension then orchestrates the automation flow.

3. Source Code Analysis

3.1 Project Structure

The repository uses a pnpm monorepo with published packages, internal packages, and application packages. Published packages are published to npm, internal packages are for internal use, and application packages include the Chrome extension.

3.2 Working Principle

3.2.1 UI Context Acquisition

Before any action, Midscene.js gathers a full UI context:

// Get screenshot
await agent.screenshotBase64();
// Get DOM tree
await agent.getElementsNodeTree();
// Get page size
await agent.getPageSize();

The screenshot is captured via Chrome DevTools Protocol, and the DOM tree is extracted by injecting a script that traverses the document, generates a stable hash ID for each element, and caches the mapping.

3.2.2 Planning with LLM (First AI Call)

The user instruction, UI context, and available action space are sent to an LLM (e.g., GPT‑4o) with a system prompt that defines the role, goals, constraints, and supported actions. The model returns a JSON plan containing actions, element IDs (e.g., "mofkb" for the search box), and a log.

{
  "actions": [
    {"type":"Input","param":{"value":"slippers","locate":{"id":"mofkb","prompt":"search box"}}},
    {"type":"KeyboardPress","param":{"keyName":"Enter"}}
  ],
  "more_actions_needed_by_instruction": false
}

3.2.3 Element Verification (Four‑Level Strategy)

When executing an action, Midscene.js validates the target element using the following priority:

XPath verification (highest priority).

Cache verification (previously stored mappings).

Plan result verification (ID from the first LLM response).

AI fallback (second LLM call to locate the element visually).

If all methods fail, an error is thrown.

3.2.4 Second AI Call for Verification

The fallback call sends the element description and UI context to a visual LLM, which returns the element ID and bounding box. This ensures accurate targeting even when the first plan lacks precise location data.

3.2.5 Action Execution

Validated actions are performed via Chrome DevTools commands:

// Clear input field
await page.clearInput(element);
// Type value
await page.keyboard.type('slippers');
// Press Enter
await page.keyboard.press({key:'Enter'});

Mouse clicks, key events, and other interactions are wrapped in helper functions that hide the cursor, dispatch CDP events, and handle asynchronous waits.

4. Issues Encountered During Use

Missing style‑based images : Elements using background-image lose their URL after DOM extraction because styles are filtered out. Adjust the extractor to keep background-image values.

Context truncation : Very large DOM trees are truncated to keep token usage low, which can drop needed information. The truncation length can be tuned via configuration.

Iframe limitations with LLM : When the target element resides inside an iframe, the visual LLM may fail; switching to a VL model can help.

VL model visibility issues : Elements outside the viewport may not be located reliably; using an LLM‑based approach with full DOM data is more robust.

VL model activation : Ensure both MIDSCENE_MODEL_NAME and MIDSCENE_USE_QWEN_VL=1 are set; otherwise visual coordinates are not returned.

domIncluded='visible-only' : This option only includes elements currently in the viewport, which can dramatically reduce accuracy for tasks needing hidden elements.

5. Business‑Level Reflections

Midscene.js serves as an AI‑driven browser agent that can be leveraged for onboarding, intelligent marketing, automated product publishing, and AI‑assisted coding. By combining natural‑language prompts with robust verification and execution layers, it opens possibilities for low‑code automation solutions across e‑commerce, content creation, and developer tooling.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI UI automation Chrome Extension Web Automation Midscene.js

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.