Boost 18 LLM Agents Without Retraining Using LIFE‑HARNESS
The article introduces LIFE‑HARNESS, a runtime‑interface adaptation framework that keeps model weights unchanged, extracts reusable failure patterns from a single model's training trace, and achieves an average 88.5% relative performance gain across 18 LLM agents and 7 deterministic environments, with successful transfer to 17 other models.
LIFE‑HARNESS: lifecycle‑aware runtime harness
LIFE‑HARNESS adapts the runtime interface between a deterministic LLM agent and its environment while keeping model weights and evaluation environments unchanged. It extracts reusable failure patterns from training trajectories and injects them as runtime interventions across four sequential layers.
1. Environment Contract Layer
Before interaction, this layer makes tool rules, call protocols, answer formats, environment constraints, and common pitfalls explicit, guiding the model to understand the exact expectations of the environment.
2. Procedural Skill Layer
Reusable procedural skills are mined from training traces and retrieved for new tasks. Examples include stable operation flows in WebShop, database query patterns, and business‑process sequences that are captured without altering model parameters.
3. Action Realization Layer
After the model generates an action but before execution, this layer validates executability, correcting issues such as missing tool calls, malformed JSON, absent parameters, incorrect function names, or SQL syntax errors.
4. Trajectory Regulation Layer
This layer monitors long trajectories for repetition, stagnation, or ineffective recovery. When the agent repeatedly searches, clicks, or performs invalid actions—especially as the step budget dwindles—it triggers corrective prompts or recovery instructions.
Experimental evaluation
The framework was tested on three benchmark suites (τ‑bench, τ²‑bench, AgentBench) covering seven deterministic environments (Airline, Retail, Telecom, ALFWorld, WebShop, OS, DBBench) and 18 diverse backbones, including instruction‑tuned, reasoning, and agent‑specialized models.
Improvement observed in 116 of 126 model‑environment configurations, with an average relative gain of 88.5%.
The harness was evolved solely from the training trace of Qwen3‑4B‑Instruct and successfully transferred to the other 17 models.
Even agent‑specialized models that had undergone tool‑use training benefited, indicating that interface, action, and trajectory failures persist beyond model‑level training.
These results show that the learned patterns capture stable, environment‑side structures rather than model‑specific quirks.
Relation to model training
Model training, instruction fine‑tuning, reinforcement learning, and distillation remain important for improving model parameters. LIFE‑HARNESS complements these approaches by modifying the runtime interface instead of the model weights.
Significance
Retraining large models incurs high cost, slow iteration, and tight coupling to specific architectures. Many agent failures arise from mismatches at the model‑environment interface rather than from insufficient model capability. Adapting observation, tool use, action formatting, feedback handling, and trajectory control offers a cost‑effective alternative.
Conclusion
LIFE‑HARNESS demonstrates that deterministic LLM agents can be substantially improved without updating model weights. Experiments confirm feasibility and effectiveness across seven environments and 18 models, with strong cross‑model transferability.
Paper: https://arxiv.org/abs/2605.22166
Code: https://github.com/Tianshi-Xu/Life-Harness
Code example
来源:PaperWeekly
本文
约1500字
,建议阅读
5
分钟
本文介绍 LIFE-HARNESS,适配运行接口,免改模型即可提升 Agent 性能。Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
