Artificial Intelligence 27 min read

How EvoCUA Set a New Open‑Source SOTA for Computer‑Use Agents with Evolutionary Learning

EvoCUA, a native computer‑use agent from Meituan, combines a verifiable data‑synthesis engine, a ten‑thousand‑level sandbox infrastructure, and an experience‑driven learning paradigm to overcome data‑scaling and feedback challenges, achieving a 56.7% success rate on the OSWorld benchmark and surpassing previous open‑source models.

Meituan Technology Team

Jan 23, 2026

How EvoCUA Set a New Open‑Source SOTA for Computer‑Use Agents with Evolutionary Learning

Background and Challenges

Large language models can perceive and reason, but performing complex GUI operations (Computer Use) suffers from two main problems: lack of high‑quality training data and missing interactive feedback. Static imitation learning cannot scale because expert trajectories are scarce, often hallucinate, and provide no corrective signals for long‑chain tasks.

Low‑quality data synthesis : real expert trajectories are rare; model‑generated instructions frequently contain hallucinations that cannot be executed in a real UI.

Missing interaction feedback : static imitation tells the model what is right but not what happens when a wrong click is made, preventing learning of causal dynamics.

Inefficient long‑chain exploration : dozens or hundreds of sequential decisions create a huge, low‑efficiency search space, and simple imitation cannot teach the model to recover from intermediate errors.

Core Technical Architecture

EvoCUA builds a closed loop of interaction‑feedback‑correction across three dimensions: data, environment, and algorithm.

Verifiable Data Synthesis Engine

The engine generates executable tasks by enforcing (1) scenario completeness (covering office documents, web retrieval, system management, etc.) and (2) execution determinism (every instruction must run successfully in a sandbox). Instead of a generate‑then‑filter pipeline, EvoCUA uses a generate‑and‑verify paradigm: while producing a natural‑language instruction it simultaneously emits Python validation code; the sandbox runs the code and only accepts data that passes.

Task space is hierarchical:

Atomic ability portability : core actions such as “data filtering” are abstracted and reused across Excel, CRM, or web back‑ends.

Complex task composition : long‑chain tasks are sequences of atomic abilities, forming a “grammar” of GUI operations.

Synthesis strategies:

Parameterized synthesis : code generators create Word/Excel files with random names, dates, prices, etc.

Non‑parameterized synthesis : public, copyright‑free images, audio, and PPT slides are injected to force the agent to handle real‑world visual noise.

Each generated item includes:

Instruction : a clear natural‑language command.

Validator : executable Python code and a ground‑truth file that define success conditions (e.g., check a cell value or file existence).

The validator runs immediately in the sandbox; any error is fed back to the task architect for iterative correction until the validator succeeds.

High‑Concurrency Sandbox Infrastructure

To support >100 000 daily active sandboxes and millions of minute‑level interactions, EvoCUA redesigns the simulator as a micro‑service‑based asynchronous system.

Async I/O gateway : a Reactor‑style non‑blocking router achieves multi‑million QPM throughput and decouples lifecycle management from data flow.

Rapid sandbox start/stop : a distributed scheduler shards resources and can launch 10 000+ sandbox instances within one minute.

Environment fidelity is ensured by a hybrid virtualization stack:

Outer layer : Docker containers orchestrated by Kubernetes reuse mature container operations.

Inner layer : QEMU‑KVM VMs provide strong isolation and near‑native GUI rendering.

OS‑level calibration : custom Ubuntu 22.04 images patch xkb for deterministic key mapping (e.g., fixing Shift+< loss) and refresh font caches ( fc-cache) to eliminate rendering gaps.

Experience‑Based Learning Paradigm

The training pipeline consists of three stages:

Cold Start : inject diverse atomic‑ability patterns and a complete action space (e.g., split Shift+Click into key_down / key_up) to give the model a solid prior.

RFT – Reject Sampling Fine‑Tuning : queries are bucketed into K‑levels {3, 8, 16, 32, 64}; each level has a success‑rate threshold. If the model meets the threshold, sampling stops; otherwise it escalates to a higher compute bucket. Step‑level denoising removes redundant or erroneous actions using a Judge Model, and infeasible tasks are trimmed to a final Terminate=Failure step.

RL – Reinforcement Learning with DPO : a high‑efficiency DPO algorithm focuses on “key divergence points” where long‑chain failures first appear. Two preference styles are used:

Action correction : treat the erroneous action at the divergence point as a negative sample and replace it with a correct one from a reference trajectory or a VLM‑generated suggestion.

Reflection & recovery : label blind continuation after an error as negative; provide a prompted reflection chain as positive, teaching the agent to pause, observe the abnormal UI, and re‑plan.

Online RL experiments show a steady reward increase, indicating potential for fully autonomous online evolution.

Experimental Evaluation

All experiments are conducted on the OSWorld online leaderboard.

OSWorld Benchmark

Open‑source SOTA : EvoCUA‑32B achieves 56.7% success, surpassing OpenCUA‑72B (45.0%) and closed‑source UI‑TARS‑2 (53.1%). Under a 50‑step inference budget, the gap to Claude‑4.5‑Sonnet (58.1%) is only 1.4%.

Small‑model advantage : EvoCUA‑8B reaches 46.1%, outperforming OpenCUA‑72B and beating Step‑GUI‑8B (40.2%) by 5.9%.

Ablation Studies

Unified action space + 4.84%.

Cold start + 2.62%.

RFT reject sampling + 3.13%.

Offline DPO on key divergence points + 3.21%.

Iterative training + 1.90%.

Scaling Analysis

Max Step : performance improves with more inference steps, but marginal gains diminish beyond 50 steps due to data scarcity.

Pass@k : increasing sampling count k consistently raises success, indicating a higher performance ceiling.

Data scale : expanding RFT data from 20 k to 1 M yields steady performance gains.

Trajectory Visualization

Example: spreadsheet task “find the max value in each row and fill column G”. Key steps include goal clarification, atomic Max operation via Excel formula, composite interaction using Shift+Click, and termination only after visual verification.

Summary and Outlook

Key take‑aways:

High signal‑to‑noise data is essential : successful trajectories are low‑noise but sparse; failure trajectories are noisy yet rich in information.

Pattern diversity outweighs sheer data volume : a lightweight, pattern‑rich cold‑start beats massive low‑quality SFT data.

On‑policy data matters : over‑reliance on off‑policy samples drifts the model away from its primary capabilities.

Visualization‑driven iteration : extensive trajectory‑visualization tools are crucial for data quality verification and debugging.

Open resources:

GitHub repository: https://github.com/meituan/EvoCUA

HuggingFace model hub: https://huggingface.co/meituan/EvoCUA-32B-20260105

Technical report PDF: https://github.com/meituan/EvoCUA/blob/main/tech_report.pdf

AI Agent Reinforcement learning data synthesis computer use OSWorld

Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.