Introducing Midscene.js: An AI‑Powered UI Automation Framework with Deep‑Think Capability

Midscene.js, the Web Infra team's AI × UI automation framework, adds Instant Actions for faster, more reliable UI operations and a Deep‑Think option that improves element localization by focusing LLM searches, with concrete code examples and model compatibility notes.

Full-Stack Cultivation Path
Full-Stack Cultivation Path
Full-Stack Cultivation Path
Introducing Midscene.js: An AI‑Powered UI Automation Framework with Deep‑Think Capability

Midscene.js adds Instant Actions and Deep Think

Midscene.js is an AI × UI automation framework released by the Web Infra team. Starting with version v0.14.0, it introduces two new features: Instant Actions, which make interactions more stable, and Deep Think, which enhances element positioning accuracy.

Instant Actions – Faster, more reliable UI operations

The original .ai interface plans steps via an LLM and then executes them, which can be unpredictable when prompts are complex. To address this, Midscene.js provides direct‑action APIs such as aiTap(), aiHover(), aiInput(), aiKeyboardPress() and aiScroll(). These calls let the LLM handle only low‑level tasks like element locating, while the actions themselves are performed immediately.

await agent.aiInput('耳机', '搜索框');
await agent.aiKeyboardPress('Enter');

Compared with the original

agent.ai('在搜索框中输入 "Headphones",按下回车键')

call, the new script removes the planning phase, as shown in the execution report (image omitted). Although the script appears more verbose, it saves time when the desired actions are clear.

Deep Think – More accurate element localization

When interacting with complex UI controls, LLMs may struggle to pinpoint the target element. The deepThink option can be added to any instant‑action call to enable a two‑step search: first locate a region containing the target, then focus on that region for a finer search, yielding more precise coordinates.

await agent.aiTap('target', { deepThink: true });

For example, on Coze.com’s workflow editor page, many custom icons in the sidebar are hard for an LLM to distinguish. Using deepThink in the YAML script (or JavaScript) allows Midscene to correctly identify each target element, as demonstrated in the accompanying screenshots.

Deep Think only works with vision‑capable models such as qwen2.5‑vl; it has no effect with models like gpt‑4o.

References

[1] Prompt‑writing tips: https://midscenejs.com/zh/prompting-tips.html

[2] Midscene.js website: https://midscenejs.com/zh

[3] GitHub repository: https://github.com/web-infra-dev/midscene

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMUI testingAI automationWeb UIDeep ThinkInstant ActionsMidscene.js
Full-Stack Cultivation Path
Written by

Full-Stack Cultivation Path

Focused on sharing practical tech content about TypeScript, Vue 3, front-end architecture, and source code analysis.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.