How EvoCUA Set a New Open‑Source SOTA for Computer‑Use Agents with Evolutionary Learning
EvoCUA, a native computer‑use agent from Meituan, combines a verifiable data‑synthesis engine, a ten‑thousand‑level sandbox infrastructure, and an experience‑driven learning paradigm to overcome data‑scaling and feedback challenges, achieving a 56.7% success rate on the OSWorld benchmark and surpassing previous open‑source models.
Background and Challenges
Large language models can perceive and reason, but performing complex GUI operations (Computer Use) suffers from two main problems: lack of high‑quality training data and missing interactive feedback. Static imitation learning cannot scale because expert trajectories are scarce, often hallucinate, and provide no corrective signals for long‑chain tasks.
Low‑quality data synthesis : real expert trajectories are rare; model‑generated instructions frequently contain hallucinations that cannot be executed in a real UI.
Missing interaction feedback : static imitation tells the model what is right but not what happens when a wrong click is made, preventing learning of causal dynamics.
Inefficient long‑chain exploration : dozens or hundreds of sequential decisions create a huge, low‑efficiency search space, and simple imitation cannot teach the model to recover from intermediate errors.
Core Technical Architecture
EvoCUA builds a closed loop of interaction‑feedback‑correction across three dimensions: data, environment, and algorithm.
Verifiable Data Synthesis Engine
The engine generates executable tasks by enforcing (1) scenario completeness (covering office documents, web retrieval, system management, etc.) and (2) execution determinism (every instruction must run successfully in a sandbox). Instead of a generate‑then‑filter pipeline, EvoCUA uses a generate‑and‑verify paradigm: while producing a natural‑language instruction it simultaneously emits Python validation code; the sandbox runs the code and only accepts data that passes.
Task space is hierarchical:
Atomic ability portability : core actions such as “data filtering” are abstracted and reused across Excel, CRM, or web back‑ends.
Complex task composition : long‑chain tasks are sequences of atomic abilities, forming a “grammar” of GUI operations.
Synthesis strategies:
Parameterized synthesis : code generators create Word/Excel files with random names, dates, prices, etc.
Non‑parameterized synthesis : public, copyright‑free images, audio, and PPT slides are injected to force the agent to handle real‑world visual noise.
Each generated item includes:
Instruction : a clear natural‑language command.
Validator : executable Python code and a ground‑truth file that define success conditions (e.g., check a cell value or file existence).
The validator runs immediately in the sandbox; any error is fed back to the task architect for iterative correction until the validator succeeds.
High‑Concurrency Sandbox Infrastructure
To support >100 000 daily active sandboxes and millions of minute‑level interactions, EvoCUA redesigns the simulator as a micro‑service‑based asynchronous system.
Async I/O gateway : a Reactor‑style non‑blocking router achieves multi‑million QPM throughput and decouples lifecycle management from data flow.
Rapid sandbox start/stop : a distributed scheduler shards resources and can launch 10 000+ sandbox instances within one minute.
Environment fidelity is ensured by a hybrid virtualization stack:
Outer layer : Docker containers orchestrated by Kubernetes reuse mature container operations.
Inner layer : QEMU‑KVM VMs provide strong isolation and near‑native GUI rendering.
OS‑level calibration : custom Ubuntu 22.04 images patch xkb for deterministic key mapping (e.g., fixing Shift+< loss) and refresh font caches ( fc-cache) to eliminate rendering gaps.
Experience‑Based Learning Paradigm
The training pipeline consists of three stages:
Cold Start : inject diverse atomic‑ability patterns and a complete action space (e.g., split Shift+Click into key_down / key_up) to give the model a solid prior.
RFT – Reject Sampling Fine‑Tuning : queries are bucketed into K‑levels {3, 8, 16, 32, 64}; each level has a success‑rate threshold. If the model meets the threshold, sampling stops; otherwise it escalates to a higher compute bucket. Step‑level denoising removes redundant or erroneous actions using a Judge Model, and infeasible tasks are trimmed to a final Terminate=Failure step.
RL – Reinforcement Learning with DPO : a high‑efficiency DPO algorithm focuses on “key divergence points” where long‑chain failures first appear. Two preference styles are used:
Action correction : treat the erroneous action at the divergence point as a negative sample and replace it with a correct one from a reference trajectory or a VLM‑generated suggestion.
Reflection & recovery : label blind continuation after an error as negative; provide a prompted reflection chain as positive, teaching the agent to pause, observe the abnormal UI, and re‑plan.
Online RL experiments show a steady reward increase, indicating potential for fully autonomous online evolution.
Experimental Evaluation
All experiments are conducted on the OSWorld online leaderboard.
OSWorld Benchmark
Open‑source SOTA : EvoCUA‑32B achieves 56.7% success, surpassing OpenCUA‑72B (45.0%) and closed‑source UI‑TARS‑2 (53.1%). Under a 50‑step inference budget, the gap to Claude‑4.5‑Sonnet (58.1%) is only 1.4%.
Small‑model advantage : EvoCUA‑8B reaches 46.1%, outperforming OpenCUA‑72B and beating Step‑GUI‑8B (40.2%) by 5.9%.
Ablation Studies
Unified action space + 4.84%.
Cold start + 2.62%.
RFT reject sampling + 3.13%.
Offline DPO on key divergence points + 3.21%.
Iterative training + 1.90%.
Scaling Analysis
Max Step : performance improves with more inference steps, but marginal gains diminish beyond 50 steps due to data scarcity.
Pass@k : increasing sampling count k consistently raises success, indicating a higher performance ceiling.
Data scale : expanding RFT data from 20 k to 1 M yields steady performance gains.
Trajectory Visualization
Example: spreadsheet task “find the max value in each row and fill column G”. Key steps include goal clarification, atomic Max operation via Excel formula, composite interaction using Shift+Click, and termination only after visual verification.
Summary and Outlook
Key take‑aways:
High signal‑to‑noise data is essential : successful trajectories are low‑noise but sparse; failure trajectories are noisy yet rich in information.
Pattern diversity outweighs sheer data volume : a lightweight, pattern‑rich cold‑start beats massive low‑quality SFT data.
On‑policy data matters : over‑reliance on off‑policy samples drifts the model away from its primary capabilities.
Visualization‑driven iteration : extensive trajectory‑visualization tools are crucial for data quality verification and debugging.
Open resources:
GitHub repository: https://github.com/meituan/EvoCUA
HuggingFace model hub: https://huggingface.co/meituan/EvoCUA-32B-20260105
Technical report PDF: https://github.com/meituan/EvoCUA/blob/main/tech_report.pdf
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
