A New Paradigm for GUI Agent Trajectory Generation: FSM‑Synthesized Data at $0.04 per Trajectory

AutoWebWorld introduces a finite‑state‑machine‑driven pipeline that synthesizes verified web‑GUI trajectories at an average cost of only $0.04 each, producing longer interaction sequences, scaling efficiently, and demonstrably improving large‑language‑model agents on WebVoyager and grounding benchmarks.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
A New Paradigm for GUI Agent Trajectory Generation: FSM‑Synthesized Data at $0.04 per Trajectory

Problem Statement

Training a GUI agent traditionally requires collecting trajectories on real websites, but real sites do not expose internal state, forcing reliance on screenshots, DOM diffs, manual labeling, or LLM judges—processes that are costly, unstable, and hard to scale.

Core Idea of AutoWebWorld

The authors propose turning the web from a black‑box into a verifiable interactive world by describing each environment with a finite‑state‑machine (FSM) that explicitly defines pages, state variables, action preconditions, transition effects, and goal states. When a trajectory satisfies the FSM transitions and reaches the goal state, its correctness is intrinsically verified.

Pipeline Overview

1. FSM Generation

A closed‑loop of FSM Proposer , FSM Validator , and FSM Improver iteratively refines a final FSM from a given web theme, ensuring logical consistency and alignment with user intent.

2. Web Synthesis

The final FSM is handed to a Coding Agent, which generates the front‑end files (e.g., style.txt, todo.md, fsm.js, data.js) and builds a runnable synthetic website. If errors occur, a Self‑Repair loop automatically fixes and restarts the build.

3. Trajectory Search

Using BFS on the FSM transition graph, the system searches from the initial state to the goal state, producing candidate trajectories whose actions are guaranteed to be executable by the precondition rules.

4. Automatic Trajectory Selection

Candidate trajectories are replayed on the synthesized website via Playwright; the system checks element existence, action success, page navigation, and goal‑state attainment, retaining only fully verified trajectories.

Data Scale and Cost

AutoWebWorld synthesised 29 distinct web environments covering 875 pages and generated 11,663 verified trajectories. The average trajectory length is 21.9 steps—significantly longer than real‑web datasets (6.9–12.1 steps). The average generation cost per trajectory is about $0.04, compared with $0.15–$1.00 for existing datasets. Total construction cost for the 29 environments is $447.37, broken down into Web Generation ($52.26), FSM Generation ($57.10), Query Generation ($65.84), and Thinking Generation ($272.17).

Training Procedure

From the 11,663 trajectories, 1,215 distinct paths are sampled to avoid over‑fitting to homogeneous transition patterns, yielding 12,585 interaction steps. Single‑step interactions are also extracted and rewritten as grounding supervision, so the final training set contains roughly 16 k GRPO steps.

Experimental Results

Evaluated on the WebVoyager benchmark (using Gemini‑3‑Flash as the judge), a 7B model trained on AutoWebWorld data achieved a 27.42% overall success rate, surpassing UI‑TARS‑1.5‑7B (26.51%) and the original Qwen2.5‑VL‑7B (5.62%). Notable per‑site gains include CD (60.47%), Coursera (30.00%), and HuggingFace (32.43%). Grounding performance was measured with ScreenSpot‑V2 and ScreenSpot‑Pro: Qwen2.5‑VL‑3B improved from 61.87 to 65.88 (V2) and from 13.3 to 18.0 (Pro); Qwen2.5‑VL‑7B improved from 84.83 to 86.16 (V2) and from 23.2 to 27.5 (Pro), confirming that the synthetic data also benefits visual grounding.

Scaling Curve

The authors trained models with 8, 256, 1,024, and 16,253 samples (maintaining a 2:8 ratio of grounding to navigation data). Success rates on WebVoyager rose from 3.92% to 27.42% as data scale increased, and on Online‑Mind2Web from 1.22% to 14.02%. Polynomial fitting suggests the upward trend would continue with larger synthetic datasets, highlighting the pipeline’s scalability.

Conclusion

AutoWebWorld provides a transition‑driven environment generation pipeline that produces verifiable, low‑cost, and scalable GUI interaction data, addressing the core bottleneck of lacking a stable source of high‑quality trajectories for GUI‑agent training.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Data GenerationScalingFinite State MachineGUI AgentAutoWebWorldSynthetic TrajectoriesWeb Navigation
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.