Artificial Intelligence 10 min read

Midscene.js Integrates Qwen‑2.5‑VL Model: Cost‑Effective, High‑Resolution UI Automation

Midscene.js v0.12 adds support for the Qwen‑2.5‑VL model, delivering GPT‑4o‑level accuracy while cutting token usage and cost by up to 80%, enabling interaction with canvas and iframe elements, offering high‑resolution input, and providing easy configuration through environment variables and a browser plugin.

ByteDance Web Infra

Feb 25, 2025

Midscene.js Integrates Qwen‑2.5‑VL Model: Cost‑Effective, High‑Resolution UI Automation

From Midscene v0.12, the UI‑automation framework now supports the Qwen‑2.5‑VL model, offering the same correctness as GPT‑4o but with over 80% lower running costs, making it a milestone for AI‑driven UI automation.

After the open‑source release of Midscene.js, users reported several pain points: high latency and cost of GPT‑4o, inability to interact with canvas/iframe elements, data‑security concerns, deployment difficulty of UI‑TARS, and a desire for pure step‑driven capabilities. Integrating Qwen‑2.5‑VL addresses all these issues.

Demo – How to integrate and compare results

Using the Alibaba Cloud Baichuan platform’s largest‑parameter model together with the Midscene sample project (see puppeteer-demo/demo.ts), the script performs a search on eBay. The key code is:

// Execute search operation
await agent.aiAction('type "Headphones" in search box, hit Enter');

Run the script with npm run test and the report shows the "qwen‑vl mode" label, confirming the model is active.

The token consumption comparison shows identical functional results but a 35% reduction in input tokens and an 89% drop in per‑run cost for Qwen‑2.5 versus GPT‑4o.

How to enable the new model

Deploy Qwen‑2.5‑VL or enable the service on Alibaba Cloud and obtain an API key.

Upgrade Midscene.js to v0.12.0 or later: npm i @midscene/web@latest Set the environment switch: MIDSCENE_USE_QWEN_VL=1 Configure additional variables (example):

OPENAI_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
OPENAI_API_KEY="sk-...."
MIDSCENE_MODEL_NAME="qwen-vl-max-latest"
MIDSCENE_USE_QWEN_VL=1

Enable real‑time profiling with: MIDSCENE_DEBUG_AI_PROFILE=1 All settings can also be toggled via the Midscene.js browser plugin.

Features of Qwen‑2.5‑VL

Coordinate recognition decouples DOM, allowing interaction with canvas, iframe, etc.

30‑50% token savings compared to GPT‑4o, with higher‑resolution input support.

Open‑source model – can be self‑hosted for better security and performance.

Available as an API service on Alibaba Cloud Baichuan.

Limitations

Small‑icon recognition is weaker than GPT‑4o.

Assertion capability is generally lower than GPT‑4o.

Cache functionality is not yet supported.

FAQ

Which model should I start with? Use the Midscene.js browser plugin for the quickest experience, then evaluate other models as needed.

Do I need to change my existing Qwen‑VL configuration? Upgrade to the latest Midscene version and enable MIDSCENE_USE_QWEN_VL=1 to benefit from the new optimizations.

Model comparison

Midscene now supports three main models: GPT‑4o, Qwen‑2.5‑VL, and the open‑source UI‑TARS. Detailed comparisons are available on the official documentation site.

UI‑TARS

Speed: up to 5× faster than generic LLMs on GPU servers.

Native image recognition without sending DOM trees.

Open‑source, allowing self‑deployment for data privacy.

Better performance with short prompts for UI tasks.

Limitations of UI‑TARS include weaker assertion capabilities compared to GPT‑4o.

Additional resources

GitHub repository: https://github.com/web-infra-dev/midscene

Documentation and website: https://midscenejs.com/

Browser plugin for quick experience: https://midscenejs.com/zh/quick-experience.html

Model selection guide: https://midscenejs.com/zh/choose-a-model.html

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

artificial-intelligence UI automation Midscene.js Qwen-2.5-VL

Written by

ByteDance Web Infra

ByteDance Web Infra team, focused on delivering excellent technical solutions, building an open tech ecosystem, and advancing front-end technology within the company and the industry | The best way to predict the future is to create it

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.