Midscene.js Integrates Qwen‑2.5‑VL Model: Cost‑Effective, High‑Resolution UI Automation
Midscene.js v0.12 adds support for the Qwen‑2.5‑VL model, delivering GPT‑4o‑level accuracy while cutting token usage and cost by up to 80%, enabling interaction with canvas and iframe elements, offering high‑resolution input, and providing easy configuration through environment variables and a browser plugin.
From Midscene v0.12, the UI‑automation framework now supports the Qwen‑2.5‑VL model, offering the same correctness as GPT‑4o but with over 80% lower running costs, making it a milestone for AI‑driven UI automation.
After the open‑source release of Midscene.js, users reported several pain points: high latency and cost of GPT‑4o, inability to interact with canvas/iframe elements, data‑security concerns, deployment difficulty of UI‑TARS, and a desire for pure step‑driven capabilities. Integrating Qwen‑2.5‑VL addresses all these issues.
Demo – How to integrate and compare results
Using the Alibaba Cloud Baichuan platform’s largest‑parameter model together with the Midscene sample project (see puppeteer-demo/demo.ts ), the script performs a search on eBay. The key code is:
// Execute search operation
await agent.aiAction('type "Headphones" in search box, hit Enter');Run the script with npm run test and the report shows the "qwen‑vl mode" label, confirming the model is active.
The token consumption comparison shows identical functional results but a 35% reduction in input tokens and an 89% drop in per‑run cost for Qwen‑2.5 versus GPT‑4o.
How to enable the new model
Deploy Qwen‑2.5‑VL or enable the service on Alibaba Cloud and obtain an API key.
Upgrade Midscene.js to v0.12.0 or later:
npm i @midscene/web@latestSet the environment switch:
MIDSCENE_USE_QWEN_VL=1Configure additional variables (example):
OPENAI_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
OPENAI_API_KEY="sk-...."
MIDSCENE_MODEL_NAME="qwen-vl-max-latest"
MIDSCENE_USE_QWEN_VL=1Enable real‑time profiling with:
MIDSCENE_DEBUG_AI_PROFILE=1All settings can also be toggled via the Midscene.js browser plugin.
Features of Qwen‑2.5‑VL
Coordinate recognition decouples DOM, allowing interaction with canvas, iframe, etc.
30‑50% token savings compared to GPT‑4o, with higher‑resolution input support.
Open‑source model – can be self‑hosted for better security and performance.
Available as an API service on Alibaba Cloud Baichuan.
Limitations
Small‑icon recognition is weaker than GPT‑4o.
Assertion capability is generally lower than GPT‑4o.
Cache functionality is not yet supported.
FAQ
Which model should I start with? Use the Midscene.js browser plugin for the quickest experience, then evaluate other models as needed.
Do I need to change my existing Qwen‑VL configuration? Upgrade to the latest Midscene version and enable MIDSCENE_USE_QWEN_VL=1 to benefit from the new optimizations.
Model comparison
Midscene now supports three main models: GPT‑4o, Qwen‑2.5‑VL, and the open‑source UI‑TARS. Detailed comparisons are available on the official documentation site.
UI‑TARS
Speed: up to 5× faster than generic LLMs on GPU servers.
Native image recognition without sending DOM trees.
Open‑source, allowing self‑deployment for data privacy.
Better performance with short prompts for UI tasks.
Limitations of UI‑TARS include weaker assertion capabilities compared to GPT‑4o.
Additional resources
GitHub repository: https://github.com/web-infra-dev/midscene
Documentation and website: https://midscenejs.com/
Browser plugin for quick experience: https://midscenejs.com/zh/quick-experience.html
Model selection guide: https://midscenejs.com/zh/choose-a-model.html
ByteDance Web Infra
ByteDance Web Infra team, focused on delivering excellent technical solutions, building an open tech ecosystem, and advancing front-end technology within the company and the industry | The best way to predict the future is to create it
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.