Introducing Midscene.js: An AI‑Powered UI Automation Framework with Deep‑Think Capability
Midscene.js, the Web Infra team's AI × UI automation framework, adds Instant Actions for faster, more reliable UI operations and a Deep‑Think option that improves element localization by focusing LLM searches, with concrete code examples and model compatibility notes.
Midscene.js adds Instant Actions and Deep Think
Midscene.js is an AI × UI automation framework released by the Web Infra team. Starting with version v0.14.0, it introduces two new features: Instant Actions, which make interactions more stable, and Deep Think, which enhances element positioning accuracy.
Instant Actions – Faster, more reliable UI operations
The original .ai interface plans steps via an LLM and then executes them, which can be unpredictable when prompts are complex. To address this, Midscene.js provides direct‑action APIs such as aiTap(), aiHover(), aiInput(), aiKeyboardPress() and aiScroll(). These calls let the LLM handle only low‑level tasks like element locating, while the actions themselves are performed immediately.
await agent.aiInput('耳机', '搜索框');
await agent.aiKeyboardPress('Enter');Compared with the original
agent.ai('在搜索框中输入 "Headphones",按下回车键')call, the new script removes the planning phase, as shown in the execution report (image omitted). Although the script appears more verbose, it saves time when the desired actions are clear.
Deep Think – More accurate element localization
When interacting with complex UI controls, LLMs may struggle to pinpoint the target element. The deepThink option can be added to any instant‑action call to enable a two‑step search: first locate a region containing the target, then focus on that region for a finer search, yielding more precise coordinates.
await agent.aiTap('target', { deepThink: true });For example, on Coze.com’s workflow editor page, many custom icons in the sidebar are hard for an LLM to distinguish. Using deepThink in the YAML script (or JavaScript) allows Midscene to correctly identify each target element, as demonstrated in the accompanying screenshots.
Deep Think only works with vision‑capable models such as qwen2.5‑vl; it has no effect with models like gpt‑4o.
References
[1] Prompt‑writing tips: https://midscenejs.com/zh/prompting-tips.html
[2] Midscene.js website: https://midscenejs.com/zh
[3] GitHub repository: https://github.com/web-infra-dev/midscene
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Full-Stack Cultivation Path
Focused on sharing practical tech content about TypeScript, Vue 3, front-end architecture, and source code analysis.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
