Stanford’s LLM-as-a-Verifier Beats Claude Mythos and GPT‑5.5 on Agent Benchmarks

Stanford, Berkeley and Nvidia researchers introduce LLM-as-a-Verifier, a universal verification framework that enhances agent performance, safety and stability on long‑horizon tasks, and outperforms Claude Mythos and GPT‑5.5 on the Terminal‑Bench and SWE‑Bench benchmarks.

Data Party THU
Data Party THU
Data Party THU
Stanford’s LLM-as-a-Verifier Beats Claude Mythos and GPT‑5.5 on Agent Benchmarks

The project, led by Stanford PhD student Jacky Kwok with contributors from Berkeley EECS and Nvidia AI, presents LLM-as-a-Verifier, a generic verification mechanism that can be combined with any Agent Harness and model.

Most agents can eventually produce a correct answer after many runs, but they lack the ability to identify which run is correct. The traditional LLM-as-a-Judge scores each trajectory with a coarse discrete rating (e.g., 1‑8), which often results in ties—27% of cases on the Terminal‑Bench—making it impossible to distinguish the best trajectory.

LLM-as-a-Verifier addresses this limitation by expanding three dimensions: (1) repeated verifications, (2) finer granularity of score tokens, and (3) decomposition of evaluation criteria. Given a task t and two candidate trajectories, the verifier constructs a scoring prompt, extracts <score_A> and <score_B> top‑logprobs, and derives a conditional distribution for rewards. The reward for each trajectory is then used in a round‑robin tournament, where the trajectory with the most wins is selected as the final answer.

Scoring formula
Scoring formula

Experimental results show that LLM-as-a-Verifier achieves state‑of‑the‑art performance on complex long‑horizon benchmarks such as Terminal‑Bench 2.0 and SWE‑Bench Verified, surpassing all frontier models. It integrates seamlessly with various Agent Harnesses, raising verification accuracy to 86.4% on ForgeCode, 79.4% on Terminus‑Kira, and 71.2% on Terminus 2. The verifier consistently outperforms LLM-as-a-Judge by at least 7% in verification accuracy even with 16 repeated checks, and completely eliminates tie situations.

Further analysis reveals that increasing the granularity of score tokens and the number of repeated verifications both significantly boost verification accuracy. Refining token granularity from 1 to 20 dramatically reduces quantization error, bringing the estimated reward closer to the true reward.

The framework decomposes evaluation into three composable criteria: Specification (whether the trajectory meets all task requirements), Output Format (whether the output matches the expected format), and Error Checking (whether obvious errors are present). By providing finer‑grained feedback, LLM-as-a-Verifier not only improves agent performance but also enhances safety and stability on long‑horizon tasks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AI AgentsBenchmarkSWE-benchStanfordTerminal-BenchAgent verificationLLM-as-a-Verifier
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.