Beyond Single-Task Experts: Introducing EEVEE, the First Fully-Evolving Agent Framework
EEVEE is a test‑time prompt‑learning framework that lets LLM agents continuously improve across diverse tasks by co‑evolving a router and specialized prompts, achieving a cumulative +42 gain over many tasks while keeping token usage low and preserving single‑task performance.
Motivation
AI agents can now write code, invoke tools, and reflect on failures, but most improvements are measured on static benchmarks. Real deployments expose agents to a continuously changing mix of coding, math, knowledge‑question, and formula tasks, raising the question of whether a single deployed agent can keep improving.
EEVEE Framework
EEVEE (from Shanghai Jiao‑Tong University and Princeton) introduces a test‑time prompt‑learning framework that moves prompt optimization from a single‑task focus to a multi‑task setting. The system maintains multiple specialized prompts instead of one monolithic prompt. For each input, a router selects the most suitable prompt, and the model generates an answer using that prompt. The router and prompts are co‑evolved: the router is first optimized to partition tasks, then each prompt is refined, and the updated prompts are fed back into the next router iteration.
Router‑Prompt Co‑evolution
The router decides which prompt sees each sample; the capability of a prompt determines which routing decisions are meaningful. EEVEE therefore alternates between optimizing the router (re‑partitioning tasks) and optimizing each prompt, feeding the improved prompts back into the router for the next round. This loop continues until routing becomes clearer and prompts become more specialized.
Experimental Setup
Four representative task families—knowledge QA, formula calculation, symbolic/math reasoning, and code generation—were combined into a mixed workload to simulate realistic agent usage.
Results on Mixed Workload
On Qwen3‑4B‑Instruct, average score increased from 41.37 to 51.75 (≈25% relative gain).
On DeepSeek‑V3.2, average score increased from 39.75 to 64.07 (≈61% relative gain).
Compared with existing SOTA prompt‑learning methods, the highest relative improvement reached 48.2%.
When tasks were added sequentially, many strong baselines stopped gaining or turned negative, whereas EEVEE consistently added +42 cumulative points after all tasks were introduced.
Single‑task performance remained strong: Formula task 55.25, HumanEval 73.17, TheoremQA improved from 14.73 to 25.27.
Token Efficiency
Despite the routing step, average token consumption per test sample was 4.32 K, comparable to the efficient baseline GEPA (3.47 K) and far lower than ACE (2 K – 1.3 K – 0 K).
Analysis of Prompt Learning
Case analyses show that prompt learning excels at converting feedback into reusable procedural strategies (e.g., maintaining function interfaces in code tasks, applying correct formulas and units in formula tasks). For knowledge‑intensive QA, prompt learning can improve reasoning structure but cannot compensate for missing factual knowledge.
Key Takeaways
EEVEE demonstrates that test‑time prompt learning can support continual adaptation in heterogeneous task streams without sacrificing single‑task ability.
The router‑prompt co‑evolution avoids the pitfalls of a single ever‑growing prompt and keeps token usage modest.
The approach works across different backbone models, indicating broad applicability.
Current version relies on ground‑truth or rule‑based feedback and is not fully self‑supervised.
Resources
Paper: https://arxiv.org/abs/2606.11182
Project page: https://princeton-ai2-lab.github.io/EEVEE/
Code repository: https://github.com/Princeton-AI2-Lab/EEVEE
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
