Beyond Single-Task Experts: Introducing EEVEE, the First Fully-Evolving Agent Framework

EEVEE is a test‑time prompt‑learning framework that lets LLM agents continuously improve across diverse tasks by co‑evolving a router and specialized prompts, achieving a cumulative +42 gain over many tasks while keeping token usage low and preserving single‑task performance.

Machine Heart
Machine Heart
Machine Heart
Beyond Single-Task Experts: Introducing EEVEE, the First Fully-Evolving Agent Framework

Motivation

AI agents can now write code, invoke tools, and reflect on failures, but most improvements are measured on static benchmarks. Real deployments expose agents to a continuously changing mix of coding, math, knowledge‑question, and formula tasks, raising the question of whether a single deployed agent can keep improving.

EEVEE Framework

EEVEE (from Shanghai Jiao‑Tong University and Princeton) introduces a test‑time prompt‑learning framework that moves prompt optimization from a single‑task focus to a multi‑task setting. The system maintains multiple specialized prompts instead of one monolithic prompt. For each input, a router selects the most suitable prompt, and the model generates an answer using that prompt. The router and prompts are co‑evolved: the router is first optimized to partition tasks, then each prompt is refined, and the updated prompts are fed back into the next router iteration.

Router‑Prompt Co‑evolution

The router decides which prompt sees each sample; the capability of a prompt determines which routing decisions are meaningful. EEVEE therefore alternates between optimizing the router (re‑partitioning tasks) and optimizing each prompt, feeding the improved prompts back into the router for the next round. This loop continues until routing becomes clearer and prompts become more specialized.

Experimental Setup

Four representative task families—knowledge QA, formula calculation, symbolic/math reasoning, and code generation—were combined into a mixed workload to simulate realistic agent usage.

Results on Mixed Workload

On Qwen3‑4B‑Instruct, average score increased from 41.37 to 51.75 (≈25% relative gain).

On DeepSeek‑V3.2, average score increased from 39.75 to 64.07 (≈61% relative gain).

Compared with existing SOTA prompt‑learning methods, the highest relative improvement reached 48.2%.

When tasks were added sequentially, many strong baselines stopped gaining or turned negative, whereas EEVEE consistently added +42 cumulative points after all tasks were introduced.

Single‑task performance remained strong: Formula task 55.25, HumanEval 73.17, TheoremQA improved from 14.73 to 25.27.

Token Efficiency

Despite the routing step, average token consumption per test sample was 4.32 K, comparable to the efficient baseline GEPA (3.47 K) and far lower than ACE (2 K – 1.3 K – 0 K).

Analysis of Prompt Learning

Case analyses show that prompt learning excels at converting feedback into reusable procedural strategies (e.g., maintaining function interfaces in code tasks, applying correct formulas and units in formula tasks). For knowledge‑intensive QA, prompt learning can improve reasoning structure but cannot compensate for missing factual knowledge.

Key Takeaways

EEVEE demonstrates that test‑time prompt learning can support continual adaptation in heterogeneous task streams without sacrificing single‑task ability.

The router‑prompt co‑evolution avoids the pitfalls of a single ever‑growing prompt and keeps token usage modest.

The approach works across different backbone models, indicating broad applicability.

Current version relies on ground‑truth or rule‑based feedback and is not fully self‑supervised.

Resources

Paper: https://arxiv.org/abs/2606.11182

Project page: https://princeton-ai2-lab.github.io/EEVEE/

Code repository: https://github.com/Princeton-AI2-Lab/EEVEE

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

multi-task learningLLM agentsprompt learningcontinual adaptationEEVEErouter-prompt coevolution
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.