SEAgent: A Self‑Evolving Computer Agent that Learns Software Use Autonomously
SEAgent introduces a self‑evolving framework that enables a GUI agent to master unfamiliar software through autonomous exploration and experience learning, leveraging a curriculum generator, a world‑state model, and GRPO‑based reinforcement with adversarial imitation, achieving state‑of‑the‑art performance on OSWorld.
Introduction
GUI agents traditionally depend on human‑annotated demonstrations, which hampers their ability to adapt to new software. SEAgent proposes a self‑evolving framework that lets an agent acquire unfamiliar software operations through autonomous exploration and experience learning, without any human supervision.
Exploration & Experience Learning Pipeline
The pipeline consists of five steps: (1) a Curriculum Generator creates tasks at the edge of the agent’s current abilities; (2) the agent executes the tasks, producing interaction trajectories; (3) a World State Model (WSM) analyses the full trajectory, labeling each step as success or failure and generating state‑change descriptions; (4) GRPO (a reinforcement‑learning algorithm) reinforces successful actions while adversarial imitation learning penalises erroneous actions; (5) based on task outcomes and the WSM descriptions, a Software Guidebook is updated to inform the next round of curriculum generation.
Curriculum Generator
The generator aims to produce tasks that match the agent’s current capability while maintaining diversity to ensure thorough exploration of the software environment. The guidebook records successful and failed histories and newly discovered functions, enabling the generator to propose feasible, varied, and appropriately difficult tasks.
World State Model
Traditional agents evaluate trajectories only by final results. SEAgent’s WSM performs step‑by‑step reasoning over all screenshots, providing fine‑grained reward signals. Because open‑source VLLM models (e.g., Qwen2.5VL) struggle with long‑sequence evaluation, the authors fine‑tuned a model with 860 high‑quality trajectories generated by GPT‑4o, achieving performance close to commercial models.
GRPO and Adversarial Imitation
WSM supplies per‑step judgments. For non‑redundant successful steps, the most optimal action in the sampled group receives positive reinforcement via GRPO. For failed steps where the optimal action is unclear, adversarial imitation learning reduces the probability of repeating the error.
Specialist‑to‑Generalist Training Strategy
To avoid knowledge conflict and catastrophic forgetting when training across multiple software, SEAgent adopts a three‑stage strategy: (1) specialist training on individual software environments, (2) knowledge distillation, and (3) generalist training. Experiments on the OSWorld benchmark show the generalist reaches a 34.5 % overall success rate, outperforming a directly trained generalist (30.6 %) and the combined specialist performance (32.2 %).
Ablation and Limitations
Ablation studies confirm that a high‑quality WSM is essential; exploration‑based RL outperforms behavior cloning, and the adversarial imitation component yields a noticeable performance boost. The main limitation is that reward quality depends entirely on the WSM’s accuracy, which caps the agent’s ultimate performance.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Network Intelligence Research Center (NIRC)
NIRC is based on the National Key Laboratory of Network and Switching Technology at Beijing University of Posts and Telecommunications. It has built a technology matrix across four AI domains—intelligent cloud networking, natural language processing, computer vision, and machine learning systems—dedicated to solving real‑world problems, creating top‑tier systems, publishing high‑impact papers, and contributing significantly to the rapid advancement of China's network technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
