Google & Microsoft Harnesses: Core LLM Post‑Training Methods and 2025‑2026 Trends

These two recent papers—Microsoft’s M⋆, which evolves task‑specific memory harnesses, and Google’s AutoHarness, which automatically generates code‑level constraints—demonstrate reflective code evolution and tree‑search synthesis, achieving state‑of‑the‑art performance across diverse benchmarks and outlining LLM post‑training directions for 2025‑2026.

Data Party THU
Data Party THU
Data Party THU
Google & Microsoft Harnesses: Core LLM Post‑Training Methods and 2025‑2026 Trends

Introduction

As LLM agents advance rapidly, designing appropriate harnesses (constraints or “bridles”) becomes a critical challenge. This article reviews two latest papers—Microsoft’s M⋆ focusing on task‑specific memory harnesses, and Google’s AutoHarness targeting automatic code‑level constraints—both proposing automated harness evolution methods.

1. M⋆: Task‑Specific Memory Harnesses

1.1 Core Problem – Limitations of Fixed Memory Structures

Current LLM agents use a one‑size‑fits‑all memory design (semantic retrieval for dialogue agents, skill systems for code agents, structured databases for specialized domains). Such designs optimized for one domain often cannot transfer to others.

Figure 1 illustrates that different tasks (Legal, Conversation, Embodied AI, Healthcare) require distinct memory harness structures such as entity‑relation graphs, relational databases, or trajectory lookup tables.

1.2 Method – Executable Program Evolution

M⋆ represents a memory harness as a Python memory program composed of three components:

Schema : defines data formats for storage and retrieval (implemented with Python dataclass).

Logic : specifies backend operations (read/write) and can invoke vector stores, SQL, or LLMs.

Instruction : provides prompt constants that guide how the agent interacts with the memory.

The system employs Reflective Code Evolution :

Validation loop sampling : evaluate the current program using static and rotated validation sets.

Coding‑Agent iteration : based on execution traces and failure cases, an LLM analyzes root causes and generates code patches.

Constraint check & auto‑repair : compile checks, smoke tests, and runtime constraints (e.g., response length ≤ 3000 characters).

A population‑based search balances exploration and exploitation by softmax temperature sampling to select high‑scoring programs for mutation.

1.3 Experimental Results

Across four heterogeneous benchmarks—LoCoMo (dialogue), ALFWorld (embodied), HealthBench (medical), and PRBench (legal/finance)—M⋆ achieved the best performance in 7 out of 8 configurations. Table 1 (partial) shows M⋆ surpassing fixed‑memory baselines on most tasks.

Figure 3 visualizes the evolution trajectory: early iterations fix structural errors, mid‑stage yields large gains, and late‑stage performs fine‑tuning.

Key Findings

Structural diversity : different tasks evolve markedly different memory structures (e.g., ALFWorld uses a simple list + LLM summary, LoCoMo combines SQL with ChromaDB).

Task specificity : cross‑task transfer experiments reveal that a memory program evolved for task A often underperforms a generic baseline on task B, confirming the need for task‑aligned harness design.

2. AutoHarness: Automatic Code‑Level Constraints

2.1 Core Problem – Illegal Actions in LLM Agents

Even though LLMs excel at code generation and mathematical reasoning, they frequently propose illegal actions in tightly defined environments (e.g., 78 % of Gemini‑2.5‑Flash failures in a Kaggle GameArena chess competition stem from illegal moves). Traditional solutions require hand‑written constraint code for each game, which is labor‑intensive and error‑prone.

2.2 Method – Tree Search + Thompson Sampling for Code Synthesis

AutoHarness frames harness generation as a program search problem. It uses Thompson‑sampling‑guided tree search to balance exploration (trying diverse logic structures) and exploitation (refining promising harnesses).

Three harness modes are supported:

harness‑as‑action‑filter : generate a set of legal action candidates, then let the LLM rank and select.

harness‑as‑action‑verifier (primary experiment): LLM proposes an action, code verifies legality, and illegal actions trigger a retry.

harness‑as‑policy : implement the entire policy in Python code, eliminating LLM calls at test time.

The iterative loop is feedback‑driven: the environment returns legality signals and rewards; based on error cases and trajectories, the LLM produces code patches (V4A format); a compile‑repair cycle automatically fixes syntax errors and runtime constraint violations.

2.3 Experimental Results

AutoHarness was evaluated on 145 TextArena games (excluding free‑text dialogue). Training converged quickly: on average 14.5 tree‑search iterations achieved 100 % legal‑action rate, with 19 of 32 games converging within 10 iterations.

Two‑player (2P) games : Gemini‑2.5‑Flash equipped with harness achieved a 9/16 win‑rate (56.3 % overall) against Gemini‑2.5‑Pro (38.2 %). This demonstrates that a smaller model with a dedicated harness can outperform a larger model.

Single‑player (1P) games : average reward 0.745, surpassing Gemini‑2.5‑Pro (0.707) and GPT‑5.2 (0.635). In the “harness‑as‑policy” extreme, generating full strategy code yielded an average reward of 0.870, beating GPT‑5.2‑High (0.844) while incurring near‑zero test‑time cost (no LLM calls).

Conclusion

Reviewing these two papers reveals a common trend: research on LLM agents is shifting from “making models smarter” to “providing agents with more suitable harness frameworks.” The presented methods and results outline promising directions for LLM post‑training research through 2025‑2026.

LLMAgentMemoryTree SearchHarnessAutoHarnessReflective Evolution
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.