Stanford’s LLM-as-a-Verifier Beats Claude Mythos and GPT‑5.5 on Agent Benchmarks
Stanford, Berkeley and Nvidia researchers introduce LLM-as-a-Verifier, a universal verification framework that enhances agent performance, safety and stability on long‑horizon tasks, and outperforms Claude Mythos and GPT‑5.5 on the Terminal‑Bench and SWE‑Bench benchmarks.
The project, led by Stanford PhD student Jacky Kwok with contributors from Berkeley EECS and Nvidia AI, presents LLM-as-a-Verifier, a generic verification mechanism that can be combined with any Agent Harness and model.
Most agents can eventually produce a correct answer after many runs, but they lack the ability to identify which run is correct. The traditional LLM-as-a-Judge scores each trajectory with a coarse discrete rating (e.g., 1‑8), which often results in ties—27% of cases on the Terminal‑Bench—making it impossible to distinguish the best trajectory.
LLM-as-a-Verifier addresses this limitation by expanding three dimensions: (1) repeated verifications, (2) finer granularity of score tokens, and (3) decomposition of evaluation criteria. Given a task t and two candidate trajectories, the verifier constructs a scoring prompt, extracts <score_A> and <score_B> top‑logprobs, and derives a conditional distribution for rewards. The reward for each trajectory is then used in a round‑robin tournament, where the trajectory with the most wins is selected as the final answer.
Experimental results show that LLM-as-a-Verifier achieves state‑of‑the‑art performance on complex long‑horizon benchmarks such as Terminal‑Bench 2.0 and SWE‑Bench Verified, surpassing all frontier models. It integrates seamlessly with various Agent Harnesses, raising verification accuracy to 86.4% on ForgeCode, 79.4% on Terminus‑Kira, and 71.2% on Terminus 2. The verifier consistently outperforms LLM-as-a-Judge by at least 7% in verification accuracy even with 16 repeated checks, and completely eliminates tie situations.
Further analysis reveals that increasing the granularity of score tokens and the number of repeated verifications both significantly boost verification accuracy. Refining token granularity from 1 to 20 dramatically reduces quantization error, bringing the estimated reward closer to the true reward.
The framework decomposes evaluation into three composable criteria: Specification (whether the trajectory meets all task requirements), Output Format (whether the output matches the expected format), and Error Checking (whether obvious errors are present). By providing finer‑grained feedback, LLM-as-a-Verifier not only improves agent performance but also enhances safety and stability on long‑horizon tasks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
