Artificial Intelligence 12 min read

How Harness Engineering Lifted LangChain Agents into the Top 5 on Terminal Bench 2.0

LangChain’s Harness Engineering framework tuned system prompts, tool selection, and middleware to turn a rank‑30 programming agent into a top‑5 performer on Terminal Bench 2.0, using trace‑driven analysis, inference‑sandwich scheduling, and context engineering without changing the underlying model.

Qborfy AI

Apr 20, 2026

How Harness Engineering Lifted LangChain Agents into the Top 5 on Terminal Bench 2.0

Goal of Harness Engineering

The model’s intelligence is "spiky"—excellent on some tasks and disastrous on others. Harness Engineering aims to tame this volatility by providing a suite of system tools—system prompts, tool sets, and middleware—that can be tuned toward specific objectives such as task‑completion rate, token efficiency, or latency.

Choosing the Right Tuning Direction

LangChain’s answer: look at the Trace. Trace data reveal where the agent fails, guiding which harness knobs to adjust.

Experiment Setup

Benchmark: Terminal Bench 2.0, a mainstream programming‑agent benchmark covering 89 tasks across machine learning, debugging, bio‑informatics, etc. The model was fixed to gpt-5.2-codex for the entire experiment.

Three harness knobs were explored:

System Prompt

Tools

Middleware (hooks around model and tool calls)

Baseline (default configuration) scored 52.8% and ranked outside the top 30.

Trace Analyzer: Making Improvements Reproducible

LangChain packaged trace analysis as a reusable Agent Skill with the following workflow:

Pull experiment trace data from LangSmith.

Launch multiple error‑analysis agents in parallel; a master agent aggregates findings and improvement suggestions.

Apply targeted modifications to the harness based on the aggregated feedback.

This process resembles boosting: each round focuses on the errors of the previous round, iteratively strengthening weak spots.

Building a Self‑Verification Loop

The authors define a three‑stage loop—Plan, Build, Verify—each with a distinct inference budget:

Planning & Exploration : deep reasoning to fully understand the problem.

Implementation : medium‑level inference to keep execution efficient.

Verification : high‑level inference to catch errors and ensure quality.

To enforce verification, they introduced PreCompletionChecklistMiddleware (the "Ralph Wiggum Loop"), which intercepts the agent before exit and forces a final check.

Context Engineering

LocalContextMiddleware

scans the current and parent directories at startup, discovers installed tools (e.g., Python), and injects this information into the agent, reducing context‑related mistakes.

Additional prompts tell the agent that its code will be automatically tested, mirroring a CI pipeline and encouraging coverage of edge cases.

A time‑budget reminder middleware warns the agent when the allotted time is nearly exhausted, prompting a switch to the verification phase.

Loop Detection

LoopDetectionMiddleware

tracks how many times each file is edited. If edits exceed a threshold, the agent receives a hint to reconsider its approach, preventing endless "fix‑the‑same‑bug" cycles.

Allocating Inference Compute

gpt-5.2-codex

offers four inference modes: low, medium, high, and xhigh. Experiments showed:

Running entirely in xhigh yielded a low score (53.9%) due to frequent timeouts.

Running entirely in high achieved 63.6%.

The final "inference sandwich" xhigh‑high‑xhigh (high inference for planning and verification, medium for implementation) pushed the score to 66.5% and moved the agent into the top 5.

The authors explain the logic: planning needs deep reasoning, implementation benefits from moderate compute, and verification demands high fidelity.

Future Direction

Adaptive reasoning—letting the model decide per‑step compute allocation—is the next evolution, already being explored by Claude and Gemini.

Practical Takeaways: Five Reusable Harness Principles

Provide Robust Context Engineering : inject directory structure, available tools, and best‑practice guidelines before the agent starts.

Enable Self‑Verification : explicitly prompt the agent to run tests and refine solutions, especially in fully autonomous coding pipelines.

Treat Trace as Feedback : use trace data to diagnose missing tools or missing operational guidance.

Detect and Fix Bad Patterns Early : design guardrails (e.g., time budgets, loop detection) to counter current model weaknesses.

Customize Harness per Model : different models (Codex vs. Claude) require tailored prompts; running the same harness on Claude Opus 4.6 gave 59.6% versus higher scores on Codex.

Open Resources

Trace dataset (public): https://smith.langchain.com/public/29393299-8f31-48bb-a949-5a1f5968a744/d?tab=2

Deep Agents source code:

Python: https://github.com/langchain-ai/deepagents

JavaScript: https://github.com/langchain-ai/deepagentsjs

Extended Reading: Harness Engineering Series

0 – Introductory guide: https://qborfy.com/ailearn/harness/00.html

1 – Six core components: https://qborfy.com/ailearn/harness/01.html

2 – Build & Verify mode details: https://qborfy.com/ailearn/harness/02.html

3 – Context engineering in practice: https://qborfy.com/ailearn/harness/03.html

4 – Multi‑agent architecture: https://qborfy.com/ailearn/harness/04.html

5 – Inference sandwich & compute allocation: https://qborfy.com/ailearn/harness/05.html

6 – Future and evolution of Harness: https://qborfy.com/ailearn/harness/06.html

AI agents prompt engineering Middleware benchmarking Trace Analysis Harness Engineering

Written by

Qborfy AI

A knowledge base that logs daily experiences and learning journeys, sharing them with you to grow together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.