Artificial Intelligence 17 min read

From Search Ads to Foundation Models: My Journey Building the EvoCUA GUI Agent

The author explains why he transitioned from search advertising algorithms to foundation model research, outlines the four typical activities of base‑model teams, and shares detailed technical insights, experimental practices, and scaling strategies that led the EvoCUA GUI Agent to achieve open‑source SOTA on OSWorld.

Baobao Algorithm Notes

Jan 26, 2026

From Search Ads to Foundation Models: My Journey Building the EvoCUA GUI Agent

Why I switched to foundation models

The transition was driven by a rational assessment of industry cycles and personal strengths: traditional search advertising (CTR/CVR) is reaching diminishing returns, while foundation models shift the focus from information distribution to information generation and task execution, offering a higher‑impact, product‑like technology.

What base‑model teams actually do

Industry surveys show four main categories of work:

Fundamental research : exploring new architectures such as Attention or Mixture‑of‑Experts; high risk, long‑term academic work.

Building general‑purpose bases : pre‑training, mid‑training, and post‑training to validate scaling laws; resource‑intensive with clear performance targets.

Agentic model development : treating the model as a product, focusing on tool‑calling and computer‑use capabilities; requires extensive post‑training and reinforcement learning with real‑world interaction.

AI‑native applications : building AI search or assistants on top of existing bases, emphasizing product flow and DAU.

The author chose categories 2 and 3, focusing on a GUI Agent sub‑direction because it combines multimodality, reasoning, and action, and leverages his data‑sense and large‑scale engineering background.

How EvoCUA was forged

EvoCUA is the first representative work after the career switch, built after thousands of experiments consuming over a million GPU‑hours. The core idea is to let the model learn from trial‑and‑error in real environments, accumulating both successful and failed experiences.

1. Global insight (planning)

Before building, the team defined a clear goal: achieve open‑source SOTA on the OSWorld benchmark. They first measured the current SOTA, identified gaps, and aligned evaluation metrics before any optimization.

Metric alignment pitfalls : many bugs were fixed to ensure the evaluation metric was reliable.

Competitor analysis : examined Qwen3‑VL, OpenCUA and other leading models.

Tooling : developed visualisation and debugging tools to inspect successful and failed trajectories frame‑by‑frame.

Typical computer‑use problems discovered include premature success judgments and incomplete action spaces (e.g., missing shift+click support).

2. Solid baseline construction

Choosing a base model (e.g., OpenCUA, Qwen3‑VL, Qwen2.5‑VL) required proving that it could reliably improve scores on in‑domain data before scaling out‑of‑domain. A simple pipeline and data processing were sufficient to verify incremental gains.

3. Large‑scale experimentation and scaling

With a stable baseline, the team entered massive experiment phases:

Ammo stockpile : listed all plausible techniques to boost metrics, emphasizing scaling from data to experience.

RFT challenges : step‑level trajectory learning introduced noise; the team performed ablations to clean step‑level data, achieving immediate improvements. Later, cold‑start and RL explorations complemented RFT.

Failed experiments were treated as valuable signals. The workflow for a failed run includes revisiting the hypothesis, recording the failure, and later using it as a callback for analysis. Most failures traced back to noisy data, duplicate samples, concentration imbalance, redundant operations, or parameter mis‑settings.

Key takeaways from six months of work

High‑signal data is crucial : low‑noise successful trajectories and high‑noise failure trajectories together drive continuous improvement.

Prior patterns outweigh raw data volume : diverse cold‑start patterns provide a stronger foundation than massive low‑quality data.

On‑policy data importance : over‑reliance on off‑policy data can drift the model away from its primary capabilities.

Visualization‑driven iteration : a full‑stack visual debugging suite is essential for data quality checks and trajectory comparison.

Training sensitivity : long‑chain agent tasks are fragile; even minor data duplication or parameter tweaks can cause collapse.

Environment uncertainty and Pass@k : GUI environments introduce latency and rendering variance; Pass@k measures both diversity and robustness to such noise.

Personal reflections

The author emphasizes rebuilding mindset and self‑evaluation criteria: deep data insight, sensitivity to frontier research, goal‑orientation, and self‑driven evaluation frameworks become essential in a field where KPI feedback is delayed.

Technical resources for EvoCUA are openly available:

GitHub: https://github.com/meituan/EvoCUA

HuggingFace model: https://huggingface.co/meituan/EvoCUA-32B-20260105

Technical report (arXiv): https://arxiv.org/abs/2601.15876

Additional reading on LLM agents and RL methods is listed in the original post (links omitted for brevity).