Artificial Intelligence 6 min read

SEAgent: A Self‑Evolving Computer Agent that Learns Software Use Autonomously

SEAgent introduces a self‑evolving framework that enables a GUI agent to master unfamiliar software through autonomous exploration and experience learning, leveraging a curriculum generator, a world‑state model, and GRPO‑based reinforcement with adversarial imitation, achieving state‑of‑the‑art performance on OSWorld.

Network Intelligence Research Center (NIRC)

Nov 4, 2025

SEAgent: A Self‑Evolving Computer Agent that Learns Software Use Autonomously

Introduction

GUI agents traditionally depend on human‑annotated demonstrations, which hampers their ability to adapt to new software. SEAgent proposes a self‑evolving framework that lets an agent acquire unfamiliar software operations through autonomous exploration and experience learning, without any human supervision.

Exploration & Experience Learning Pipeline

The pipeline consists of five steps: (1) a Curriculum Generator creates tasks at the edge of the agent’s current abilities; (2) the agent executes the tasks, producing interaction trajectories; (3) a World State Model (WSM) analyses the full trajectory, labeling each step as success or failure and generating state‑change descriptions; (4) GRPO (a reinforcement‑learning algorithm) reinforces successful actions while adversarial imitation learning penalises erroneous actions; (5) based on task outcomes and the WSM descriptions, a Software Guidebook is updated to inform the next round of curriculum generation.

Curriculum Generator

The generator aims to produce tasks that match the agent’s current capability while maintaining diversity to ensure thorough exploration of the software environment. The guidebook records successful and failed histories and newly discovered functions, enabling the generator to propose feasible, varied, and appropriately difficult tasks.

World State Model

Traditional agents evaluate trajectories only by final results. SEAgent’s WSM performs step‑by‑step reasoning over all screenshots, providing fine‑grained reward signals. Because open‑source VLLM models (e.g., Qwen2.5VL) struggle with long‑sequence evaluation, the authors fine‑tuned a model with 860 high‑quality trajectories generated by GPT‑4o, achieving performance close to commercial models.

GRPO and Adversarial Imitation

WSM supplies per‑step judgments. For non‑redundant successful steps, the most optimal action in the sampled group receives positive reinforcement via GRPO. For failed steps where the optimal action is unclear, adversarial imitation learning reduces the probability of repeating the error.

Specialist‑to‑Generalist Training Strategy

To avoid knowledge conflict and catastrophic forgetting when training across multiple software, SEAgent adopts a three‑stage strategy: (1) specialist training on individual software environments, (2) knowledge distillation, and (3) generalist training. Experiments on the OSWorld benchmark show the generalist reaches a 34.5 % overall success rate, outperforming a directly trained generalist (30.6 %) and the combined specialist performance (32.2 %).

Ablation and Limitations

Ablation studies confirm that a high‑quality WSM is essential; exploration‑based RL outperforms behavior cloning, and the adversarial imitation component yields a noticeable performance boost. The main limitation is that reward quality depends entirely on the WSM’s accuracy, which caps the agent’s ultimate performance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Reinforcement Learning curriculum learning GUI automation autonomous learning experience learning SEAgent world state model

Written by

Network Intelligence Research Center (NIRC)

NIRC is based on the National Key Laboratory of Network and Switching Technology at Beijing University of Posts and Telecommunications. It has built a technology matrix across four AI domains—intelligent cloud networking, natural language processing, computer vision, and machine learning systems—dedicated to solving real‑world problems, creating top‑tier systems, publishing high‑impact papers, and contributing significantly to the rapid advancement of China's network technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.