Boost 18 LLM Agents Without Retraining Using LIFE‑HARNESS

The article introduces LIFE‑HARNESS, a runtime‑interface adaptation framework that keeps model weights unchanged, extracts reusable failure patterns from a single model's training trace, and achieves an average 88.5% relative performance gain across 18 LLM agents and 7 deterministic environments, with successful transfer to 17 other models.

Data Party THU
Data Party THU
Data Party THU
Boost 18 LLM Agents Without Retraining Using LIFE‑HARNESS

LIFE‑HARNESS: lifecycle‑aware runtime harness

LIFE‑HARNESS adapts the runtime interface between a deterministic LLM agent and its environment while keeping model weights and evaluation environments unchanged. It extracts reusable failure patterns from training trajectories and injects them as runtime interventions across four sequential layers.

1. Environment Contract Layer

Before interaction, this layer makes tool rules, call protocols, answer formats, environment constraints, and common pitfalls explicit, guiding the model to understand the exact expectations of the environment.

2. Procedural Skill Layer

Reusable procedural skills are mined from training traces and retrieved for new tasks. Examples include stable operation flows in WebShop, database query patterns, and business‑process sequences that are captured without altering model parameters.

3. Action Realization Layer

After the model generates an action but before execution, this layer validates executability, correcting issues such as missing tool calls, malformed JSON, absent parameters, incorrect function names, or SQL syntax errors.

4. Trajectory Regulation Layer

This layer monitors long trajectories for repetition, stagnation, or ineffective recovery. When the agent repeatedly searches, clicks, or performs invalid actions—especially as the step budget dwindles—it triggers corrective prompts or recovery instructions.

Experimental evaluation

The framework was tested on three benchmark suites (τ‑bench, τ²‑bench, AgentBench) covering seven deterministic environments (Airline, Retail, Telecom, ALFWorld, WebShop, OS, DBBench) and 18 diverse backbones, including instruction‑tuned, reasoning, and agent‑specialized models.

Improvement observed in 116 of 126 model‑environment configurations, with an average relative gain of 88.5%.

The harness was evolved solely from the training trace of Qwen3‑4B‑Instruct and successfully transferred to the other 17 models.

Even agent‑specialized models that had undergone tool‑use training benefited, indicating that interface, action, and trajectory failures persist beyond model‑level training.

These results show that the learned patterns capture stable, environment‑side structures rather than model‑specific quirks.

Relation to model training

Model training, instruction fine‑tuning, reinforcement learning, and distillation remain important for improving model parameters. LIFE‑HARNESS complements these approaches by modifying the runtime interface instead of the model weights.

Significance

Retraining large models incurs high cost, slow iteration, and tight coupling to specific architectures. Many agent failures arise from mismatches at the model‑environment interface rather than from insufficient model capability. Adapting observation, tool use, action formatting, feedback handling, and trajectory control offers a cost‑effective alternative.

Conclusion

LIFE‑HARNESS demonstrates that deterministic LLM agents can be substantially improved without updating model weights. Experiments confirm feasibility and effectiveness across seven environments and 18 models, with strong cross‑model transferability.

Paper: https://arxiv.org/abs/2605.22166

Code: https://github.com/Tianshi-Xu/Life-Harness

Code example

来源:PaperWeekly
本文
约1500字
,建议阅读
5
分钟
本文介绍 LIFE-HARNESS,适配运行接口,免改模型即可提升 Agent 性能。
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLM agentsbenchmark evaluationruntime harnesscross‑model transferinterface adaptation
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.