The Next Frontier for Large‑Scale LLM Agents: 17 Must‑Read Papers on Self‑Evolving Harnesses
This article surveys 17 recent core papers that explore how the system‑level harness surrounding large‑model agents can be automatically generated, evolved, and audited, covering topics such as system boundaries, failure‑driven improvement, memory and skill optimization, source‑level rewriting, scaling laws, aging, and safety.
System Boundary
The survey begins by defining the seven dimensions of an agent harness—execution, tools, context, lifecycle, observability, verification, and governance—showing how these external system layers, rather than model weights, crucially affect agent performance.
Automatic Generation
Several papers propose generating harness code automatically. One approach uses tree search and Thompson sampling to explore program space, letting the model refine actions based on environment feedback; experiments on TextArena (145 games) show that generated code can block illegal actions and enable smaller models to outperform larger ones.
Improving from Failures
Other works focus on diagnosing failures and repairing harnesses. A framework extracts failure modes from execution traces, proposes minimal harness edits, and validates them via regression tests. On Terminal‑Bench‑2, this method raises held‑out pass rates from 40.5% to 61.9% (MiniMax M2.5) and similar gains for other models.
Memory Harnesses
Memory‑centric papers argue that each task needs a dedicated memory structure. They encode memory as Python programs (data structures, storage logic, and commands) and use reflective code evolution and population search to discover optimal memory harnesses. Experiments on LoCoMo, ALFWorld, HealthBench, and PRBench show that task‑specific memory programs improve performance in 7 out of 8 configurations.
Skill Optimization
Skill‑optimization papers treat agent skills as frozen external state files. An optimizer generates textual edits (add, delete, replace) to improve skill files, accepting changes only when validation scores strictly increase. Across six benchmarks, seven target models, and three harness types, the method achieves best or tied‑best results in 52 out of 56 combinations, with notable score gains on GPT‑5.5.
Runtime Interface Adaptation
Some works avoid changing model weights or environments, instead adapting the runtime interface. By extracting recurring interaction failures from training traces and converting them into reusable interventions (e.g., environment constraints, process skills, action implementations), they improve success rates on deterministic benchmarks, achieving an average 88.5% relative improvement across 116 model‑environment pairs.
Source‑Level Rewriting (MOSS)
Source‑level self‑evolution systems detect failure evidence, invoke external code agents to modify the agent system’s source code, and then apply batch replay, trial runs, user approval, hot container replacement, and rollback. On OpenClaw, a single evolution round raises average evaluator scores from 0.25 to 0.61.
Scaling Laws for Harnesses
A scaling‑law study introduces the notion of Effective Feedback Compute (EFC), counting only feedback that is valid, legal, non‑redundant, and retained for later decisions. Experiments show that EFC predicts failure rates far better than raw token counts or tool‑call numbers, and that improving feedback quality can raise success probability from 0.27 to 0.90 under fixed budgets.
Agent Aging
Agent Lifespan Engineering (ALE) studies long‑term degradation caused by compression, interference, revision, and maintenance of memory. Using time‑dependent graphs and counterfactual probes, the work evaluates 7 scenarios, 14 models, and multiple memory strategies, extending conversation lengths to 200 rounds to reveal hidden accuracy decay even when behavior tests appear normal.
Safety Auditing of Harnesses
HarnessAudit proposes a full‑trace audit that checks boundary compliance, execution fidelity, and system stability. Built on a benchmark of 210 tasks across 8 real domains, the audit reveals that task‑completion rates and safety do not always align; violations accumulate with longer traces and multi‑agent collaboration expands the risk surface.
Overall Trend
Collectively, these works illustrate a shift: the harness is moving from a hidden engineering detail to a first‑class research object that can be defined, optimized, and evaluated. While model capability remains important, future agent assessments will increasingly consider the harness in which the model operates.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
