Can Agents Truly Self‑Evolve? GDPevo Benchmark That No Agent Can Cheat

The article introduces GDPevo, the first open‑source benchmark that quantifies self‑evolution in agents by generating 120 real‑world enterprise tasks, using rule‑hybrid question creation and deterministic scoring, and shows that self‑evolving agents improve accuracy by 17‑22% while reducing token consumption.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Can Agents Truly Self‑Evolve? GDPevo Benchmark That No Agent Can Cheat

Why self‑evolution has become the hot track

Imagine a new employee who learns from hands‑on experience and eventually solves unseen problems on their own. Replacing that employee with an AI yields the concept of a self‑evolving agent (Self‑evolution). The key insight is that any capability that can be clearly evaluated and automated will quickly reach its limits, just as in Go, coding, or math.

Problem: measuring self‑evolution

Most current agents are "one‑shot"—they cannot transfer experience from one task to the next. To assess true self‑evolution, we must first be able to measure it. Real‑world enterprise tasks (invoice audit, exhibition logistics, compliance, credit approval) involve fragmented, context‑rich rules and lack dedicated benchmarks, making evaluation difficult. Moreover, training on the test set leads to cheating, so a robust metric is essential.

GDPevo: the first benchmark for self‑evolution with economic value

We built GDPevo, an automated benchmark pipeline that generates, validates, and publishes 120 real‑world tasks across CRM, ERP, and Finance. Each task includes 5 training samples and 5 test samples, each with rule‑based scoring scripts. The pipeline consists of a seed‑scenario pool (GDPval, SOP‑Bench, JobBench), a multi‑agent task factory, quality review, and release.

Challenge 1 – Fully automated question generation

We designed an end‑to‑end process where humans define the workflow once, then AI continuously creates new questions, grades them, and iterates (inspired by Loop Engineering). This approach prevents data leakage because new questions are generated faster than models can memorize answers, and it scales without human bottlenecks.

Challenge 2 – Rule hybridization to force genuine learning

To avoid the "train‑on‑test" trap, we split complex business logic into atomic "meta‑rules" and embed them across five training samples, each exposing only a subset. Then we recombine those rules into test samples, forcing agents to generalize rather than memorize. Agents without self‑evolution see scattered fragments; self‑evolving agents can infer the underlying patterns and apply them to new tasks.

Scoring methodology

GDPevo uses a deterministic rule‑based scorer instead of LLM‑as‑a‑Judge. Scores are composed of multiple rubrics, ensuring reproducibility and traceability. Each failure point is explicitly reported, turning the benchmark into a diagnostic tool for agents.

We also enforce that cost (total token consumption) and accuracy are equally important. Every run logs both metrics, enabling trade‑off analysis and iterative optimization.

Ease of use

The evaluation requires no SDK; a natural‑language‑driven workspace (a Markdown folder) lets users describe the experiment in a single sentence, and the entire pipeline runs automatically, producing reports and charts without writing code.

Results

We evaluated three agents (base, few‑shot, reflect) on 12 task groups (120 tasks), averaging three runs per task. Self‑evolution raised test‑set accuracy by ~17‑22%. Two agents (Claude Code and Codex) also reduced token usage, achieving higher accuracy with lower cost.

Notable single‑task gains include:

Operational financial modeling: Codex improved from 42.76% to 92.47% with fewer tokens.

Claude Code few‑shot reached 100% (baseline 51.76%).

Panofy reflect climbed to 92.47% (baseline 62.39%).

Overall, the agents demonstrated genuine self‑evolution: they learned from training samples, abstracted rules, and transferred that knowledge to unseen tasks, aligning with prior work on continual learning and recursive self‑improvement.

Open‑source invitation

The full GDPevo pipeline, artifacts, and results are publicly released on GitHub (https://github.com/Prism-Shadow/GDPevo). Researchers are encouraged to bring their own agents or business scenarios to extend the benchmark. The goal is not to create a leaderboard but to provide a scalable foundation for self‑evolving agents, ultimately freeing humans from repetitive work.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AI benchmarkContinual LearningAgent evaluationSelf-Evolving AgentsGDPevorule hybridization
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.