The Next Frontier for Large‑Scale LLM Agents: 17 Must‑Read Papers on Self‑Evolving Harnesses

This article surveys 17 recent core papers that explore how the system‑level harness surrounding large‑model agents can be automatically generated, evolved, and audited, covering topics such as system boundaries, failure‑driven improvement, memory and skill optimization, source‑level rewriting, scaling laws, aging, and safety.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
The Next Frontier for Large‑Scale LLM Agents: 17 Must‑Read Papers on Self‑Evolving Harnesses

System Boundary

The survey begins by defining the seven dimensions of an agent harness—execution, tools, context, lifecycle, observability, verification, and governance—showing how these external system layers, rather than model weights, crucially affect agent performance.

System boundary diagram
System boundary diagram

Automatic Generation

Several papers propose generating harness code automatically. One approach uses tree search and Thompson sampling to explore program space, letting the model refine actions based on environment feedback; experiments on TextArena (145 games) show that generated code can block illegal actions and enable smaller models to outperform larger ones.

Automatic harness generation
Automatic harness generation

Improving from Failures

Other works focus on diagnosing failures and repairing harnesses. A framework extracts failure modes from execution traces, proposes minimal harness edits, and validates them via regression tests. On Terminal‑Bench‑2, this method raises held‑out pass rates from 40.5% to 61.9% (MiniMax M2.5) and similar gains for other models.

Failure‑driven improvement loop
Failure‑driven improvement loop

Memory Harnesses

Memory‑centric papers argue that each task needs a dedicated memory structure. They encode memory as Python programs (data structures, storage logic, and commands) and use reflective code evolution and population search to discover optimal memory harnesses. Experiments on LoCoMo, ALFWorld, HealthBench, and PRBench show that task‑specific memory programs improve performance in 7 out of 8 configurations.

Memory harness examples
Memory harness examples

Skill Optimization

Skill‑optimization papers treat agent skills as frozen external state files. An optimizer generates textual edits (add, delete, replace) to improve skill files, accepting changes only when validation scores strictly increase. Across six benchmarks, seven target models, and three harness types, the method achieves best or tied‑best results in 52 out of 56 combinations, with notable score gains on GPT‑5.5.

Skill optimization pipeline
Skill optimization pipeline

Runtime Interface Adaptation

Some works avoid changing model weights or environments, instead adapting the runtime interface. By extracting recurring interaction failures from training traces and converting them into reusable interventions (e.g., environment constraints, process skills, action implementations), they improve success rates on deterministic benchmarks, achieving an average 88.5% relative improvement across 116 model‑environment pairs.

Runtime interface adaptation
Runtime interface adaptation

Source‑Level Rewriting (MOSS)

Source‑level self‑evolution systems detect failure evidence, invoke external code agents to modify the agent system’s source code, and then apply batch replay, trial runs, user approval, hot container replacement, and rollback. On OpenClaw, a single evolution round raises average evaluator scores from 0.25 to 0.61.

Source‑level evolution flow
Source‑level evolution flow

Scaling Laws for Harnesses

A scaling‑law study introduces the notion of Effective Feedback Compute (EFC), counting only feedback that is valid, legal, non‑redundant, and retained for later decisions. Experiments show that EFC predicts failure rates far better than raw token counts or tool‑call numbers, and that improving feedback quality can raise success probability from 0.27 to 0.90 under fixed budgets.

Effective feedback compute
Effective feedback compute

Agent Aging

Agent Lifespan Engineering (ALE) studies long‑term degradation caused by compression, interference, revision, and maintenance of memory. Using time‑dependent graphs and counterfactual probes, the work evaluates 7 scenarios, 14 models, and multiple memory strategies, extending conversation lengths to 200 rounds to reveal hidden accuracy decay even when behavior tests appear normal.

Agent aging mechanisms
Agent aging mechanisms

Safety Auditing of Harnesses

HarnessAudit proposes a full‑trace audit that checks boundary compliance, execution fidelity, and system stability. Built on a benchmark of 210 tasks across 8 real domains, the audit reveals that task‑completion rates and safety do not always align; violations accumulate with longer traces and multi‑agent collaboration expands the risk surface.

Safety audit pipeline
Safety audit pipeline

Overall Trend

Collectively, these works illustrate a shift: the harness is moving from a hidden engineering detail to a first‑class research object that can be defined, optimized, and evaluated. While model capability remains important, future agent assessments will increasingly consider the harness in which the model operates.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

scaling lawsAgent MemoryLLM AgentsSelf‑EvolutionHarness EngineeringSkill OptimizationSafety Audit
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.