Surpassing Claude Mythos and GPT‑5.5: Stanford’s New LLM‑as‑a‑Verifier Agent Framework

Stanford, Berkeley and Nvidia introduce LLM‑as‑a‑Verifier, a verification framework that scales verification compute, uses fine‑grained score tokens, repeated checks and criteria decomposition to boost agent performance, eliminate scoring ties and achieve SOTA results on Terminal‑Bench, surpassing Claude Mythos and GPT‑5.5 while improving safety in long‑horizon tasks.

Machine Heart
Machine Heart
Machine Heart
Surpassing Claude Mythos and GPT‑5.5: Stanford’s New LLM‑as‑a‑Verifier Agent Framework

Method Overview

Stanford, UC Berkeley and Nvidia jointly propose the Agent verification framework LLM-as-a-Verifier , a generic verification mechanism that can be combined with any Agent Harness and model. By scaling verification compute, the method significantly improves overall agent performance and surpasses GPT‑5.5 and Claude Mythos on the AI‑coding benchmark Terminal‑Bench.

Core Problem of LLM-as-a-Judge

Most Agent Harnesses can solve a problem if run many times (e.g., 100 runs), but they cannot identify which run yields the correct answer, a difficulty that is especially acute for long‑horizon tasks. The standard LLM‑as‑a‑Judge scores each trajectory with a coarse discrete rating (e.g., 1–8). This coarse granularity often assigns the same score to different trajectories, leading to ties—27% of cases on Terminal‑Bench—thereby limiting discriminative power.

LLM-as-a-Verifier Design

The verifier improves upon the judge by expanding three dimensions:

Repeated verification (multiple evaluations of the same trajectory).

Granularity of score tokens (finer‑grained scoring).

Decomposition of evaluation criteria.

Increasing score‑token granularity widens the score gap between positive and negative samples, enhancing discriminability.

Formal Scoring Procedure

Given a task t and two candidate trajectories

, the verifier constructs a scoring prompt and extracts the top‑log‑probs of <score_A> and <score_B> to obtain conditional distributions. The reward for a trajectory is computed as:

where C is the number of evaluation criteria, K the number of repeated verifications, G the number of score‑token levels, and other symbols denote token probabilities and mapping functions. To select the best trajectory, a round‑robin tournament compares every pair (i, j); the trajectory with the most wins is chosen.

Experimental Results

1. On the complex long‑horizon benchmarks Terminal‑Bench 2.0 and SWE‑Bench Verified, LLM-as-a‑Verifier achieves current SOTA performance, with all results drawn from official leaderboards.

2. The verifier integrates seamlessly with different Agent Harness frameworks, improving accuracy to:

ForgeCode: 86.4%

Terminus‑Kira: 79.4%

Terminus 2: 71.2%

3. Compared with the traditional LLM-as-a‑Judge, the verifier maintains at least a 7% verification‑accuracy advantage even when the number of repetitions is increased to k = 16, and it completely eliminates tie situations.

4. Ablation studies show that both higher score‑token granularity and more repeated verifications markedly raise verification accuracy. Refining granularity from 1 to 20 dramatically reduces quantization error, bringing the reward closer to the true reward.

Decomposed Evaluation Criteria

Specification compliance – whether the trajectory satisfies all task requirements.

Output format – whether the generated output matches the expected format.

Error checking – detection of obvious error signals in the trajectory.

Verification Compute as a New Extension Dimension

LLM-as-a‑Verifier is a universal verification mechanism that substantially boosts overall agent performance and attains SOTA results on multiple AI‑coding benchmarks, surpassing other leading models such as Claude Mythos. By employing finer scoring granularity, repeated verification, and criteria decomposition, it achieves higher verification accuracy, eliminates scoring ties, and enhances safety and stability in long‑horizon tasks.

LLMSWE-BenchTerminal-BenchLong-Horizon TasksAgent VerificationLLM-as-a-Verifier
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.