How Harness Engineering Lifted LangChain Agents into the Top 5 on Terminal Bench 2.0
LangChain’s Harness Engineering framework tuned system prompts, tool selection, and middleware to turn a rank‑30 programming agent into a top‑5 performer on Terminal Bench 2.0, using trace‑driven analysis, inference‑sandwich scheduling, and context engineering without changing the underlying model.
Goal of Harness Engineering
The model’s intelligence is "spiky"—excellent on some tasks and disastrous on others. Harness Engineering aims to tame this volatility by providing a suite of system tools—system prompts, tool sets, and middleware—that can be tuned toward specific objectives such as task‑completion rate, token efficiency, or latency.
Choosing the Right Tuning Direction
LangChain’s answer: look at the Trace. Trace data reveal where the agent fails, guiding which harness knobs to adjust.
Experiment Setup
Benchmark: Terminal Bench 2.0, a mainstream programming‑agent benchmark covering 89 tasks across machine learning, debugging, bio‑informatics, etc. The model was fixed to gpt-5.2-codex for the entire experiment.
Three harness knobs were explored:
System Prompt
Tools
Middleware (hooks around model and tool calls)
Baseline (default configuration) scored 52.8% and ranked outside the top 30.
Trace Analyzer: Making Improvements Reproducible
LangChain packaged trace analysis as a reusable Agent Skill with the following workflow:
Pull experiment trace data from LangSmith.
Launch multiple error‑analysis agents in parallel; a master agent aggregates findings and improvement suggestions.
Apply targeted modifications to the harness based on the aggregated feedback.
This process resembles boosting: each round focuses on the errors of the previous round, iteratively strengthening weak spots.
Building a Self‑Verification Loop
The authors define a three‑stage loop—Plan, Build, Verify—each with a distinct inference budget:
Planning & Exploration : deep reasoning to fully understand the problem.
Implementation : medium‑level inference to keep execution efficient.
Verification : high‑level inference to catch errors and ensure quality.
To enforce verification, they introduced PreCompletionChecklistMiddleware (the "Ralph Wiggum Loop"), which intercepts the agent before exit and forces a final check.
Context Engineering
LocalContextMiddlewarescans the current and parent directories at startup, discovers installed tools (e.g., Python), and injects this information into the agent, reducing context‑related mistakes.
Additional prompts tell the agent that its code will be automatically tested, mirroring a CI pipeline and encouraging coverage of edge cases.
A time‑budget reminder middleware warns the agent when the allotted time is nearly exhausted, prompting a switch to the verification phase.
Loop Detection
LoopDetectionMiddlewaretracks how many times each file is edited. If edits exceed a threshold, the agent receives a hint to reconsider its approach, preventing endless "fix‑the‑same‑bug" cycles.
Allocating Inference Compute
gpt-5.2-codexoffers four inference modes: low, medium, high, and xhigh. Experiments showed:
Running entirely in xhigh yielded a low score (53.9%) due to frequent timeouts.
Running entirely in high achieved 63.6%.
The final "inference sandwich" xhigh‑high‑xhigh (high inference for planning and verification, medium for implementation) pushed the score to 66.5% and moved the agent into the top 5.
The authors explain the logic: planning needs deep reasoning, implementation benefits from moderate compute, and verification demands high fidelity.
Future Direction
Adaptive reasoning—letting the model decide per‑step compute allocation—is the next evolution, already being explored by Claude and Gemini.
Practical Takeaways: Five Reusable Harness Principles
Provide Robust Context Engineering : inject directory structure, available tools, and best‑practice guidelines before the agent starts.
Enable Self‑Verification : explicitly prompt the agent to run tests and refine solutions, especially in fully autonomous coding pipelines.
Treat Trace as Feedback : use trace data to diagnose missing tools or missing operational guidance.
Detect and Fix Bad Patterns Early : design guardrails (e.g., time budgets, loop detection) to counter current model weaknesses.
Customize Harness per Model : different models (Codex vs. Claude) require tailored prompts; running the same harness on Claude Opus 4.6 gave 59.6% versus higher scores on Codex.
Open Resources
Trace dataset (public): https://smith.langchain.com/public/29393299-8f31-48bb-a949-5a1f5968a744/d?tab=2
Deep Agents source code:
Python: https://github.com/langchain-ai/deepagents
JavaScript: https://github.com/langchain-ai/deepagentsjs
Extended Reading: Harness Engineering Series
0 – Introductory guide: https://qborfy.com/ailearn/harness/00.html
1 – Six core components: https://qborfy.com/ailearn/harness/01.html
2 – Build & Verify mode details: https://qborfy.com/ailearn/harness/02.html
3 – Context engineering in practice: https://qborfy.com/ailearn/harness/03.html
4 – Multi‑agent architecture: https://qborfy.com/ailearn/harness/04.html
5 – Inference sandwich & compute allocation: https://qborfy.com/ailearn/harness/05.html
6 – Future and evolution of Harness: https://qborfy.com/ailearn/harness/06.html
Qborfy AI
A knowledge base that logs daily experiences and learning journeys, sharing them with you to grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
