Vertical Domain Agents Gain 88.5% Boost by Adapting the Runtime Interface, Not Retraining
The paper shows that many failures of deterministic LLM agents stem from mismatched model‑environment interfaces, and introduces LIFE‑HARNESS—a four‑layer runtime harness that extracts reusable failure patterns from training trajectories without updating model weights, delivering an average 88.5% relative performance gain across 126 model‑environment settings.
Motivation
Improving LLM agents often relies on larger models, instruction fine‑tuning, reinforcement learning, or distillation. In many deterministic vertical tasks, failures arise not from insufficient model capability but from mismatches at the model‑environment interface.
Observed failure modes include:
Model knows the goal but does not invoke the correct tool.
Intent is correct but the emitted JSON, function name, parameter types, or SQL syntax cannot be parsed by the environment.
Environment returns an error and the model fails to trigger an effective recovery.
Repeated searches, clicks, or ineffective actions exhaust the step budget.
These observations motivate treating an agent as a runtime system that continuously observes the environment, understands tools, submits actions, receives feedback, and decides across multiple turns. Performance therefore depends on both model parameters and the runtime interface.
LIFE‑HARNESS Architecture
LIFE‑HARNESS does not modify model weights nor the evaluation environment. It mines reusable failure patterns from training trajectories and converts them into runtime interventions. The design consists of four layers:
1. Environment Contract Layer
Before interaction begins, this layer makes tool rules, action protocols, answer formats, environment constraints, and common pitfalls explicit, prompting the model to answer the question “What does this environment expect me to do?”
“What does this environment require me to do?”
2. Procedural Skill Layer
Many vertical tasks follow stable procedural patterns (e.g., shopping agent: search → filter → compare → purchase; database task: locate table/field → construct query → check result). These patterns are extracted from successful training trajectories and supplied to the model as reusable task guides.
When encountering a similar task, refer to the operation path from past successful trajectories.
3. Action Realization Layer
Even if the model’s intent is correct, the action may be unparsable (e.g., emitting natural language instead of a tool call, missing JSON fields, wrong parameter types, misspelled function names). This layer validates actions before execution and corrects format issues.
The model’s intended action must be transformable into an executable environment command.
4. Trajectory Regulation Layer
Failures can also manifest across the whole trajectory: repeated searches, clicks, or ineffective actions, and continued attempts after receiving error feedback. This layer monitors repetitions, stagnation, and ineffective recovery signals, and triggers corrective prompts when the trajectory falls into a loop.
When the trajectory starts looping, pull the model back onto an effective path.
Experimental Evaluation
Evaluation used three benchmark suites—τ‑bench, τ²‑bench, and AgentBench—covering seven deterministic agent environments. Eighteen model backbones were tested, including instruction‑tuned, reasoning, and agent‑specialized models.
Results:
Improvement observed in 116 out of 126 model‑environment configurations.
Average relative gain of 88.5 % .
The harness was derived solely from training trajectories of Qwen3‑4B‑Instruct yet transferred successfully to the other 17 models, indicating that it captures environment‑side structural knowledge (tool protocols, task flows, error patterns, recovery strategies) rather than model‑specific quirks.
Relationship to Model Training
Model training remains essential for improving raw model ability. However, in deterministic vertical agents many failures stem from interface mismatches rather than lack of capability. Even agent‑specialized models that have undergone tool‑use training benefit from LIFE‑HARNESS, showing that tool‑use training alone does not eliminate interface, action, or trajectory failures.
Model training enhances model ability; runtime harness improves the interaction conditions between model and environment.
Conclusion
The study demonstrates that deterministic LLM agents can be substantially improved without updating model weights by adapting the runtime harness. When an agent underperforms, the first diagnostic step should be to assess whether the issue lies in the model‑environment interface rather than the model itself.
Article: https://arxiv.org/pdf/2605.22166
Title: Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents
Institution: Peking University
Repository: https://github.com/Tianshi-Xu/Life-HarnessSigned-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
