Machine Heart
Machine Heart
Apr 26, 2026 · Artificial Intelligence

Surpassing Claude Mythos and GPT‑5.5: Stanford’s New LLM‑as‑a‑Verifier Agent Framework

Stanford, Berkeley and Nvidia introduce LLM‑as‑a‑Verifier, a verification framework that scales verification compute, uses fine‑grained score tokens, repeated checks and criteria decomposition to boost agent performance, eliminate scoring ties and achieve SOTA results on Terminal‑Bench, surpassing Claude Mythos and GPT‑5.5 while improving safety in long‑horizon tasks.

Agent VerificationLLMLLM-as-a-Verifier
0 likes · 8 min read
Surpassing Claude Mythos and GPT‑5.5: Stanford’s New LLM‑as‑a‑Verifier Agent Framework
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 12, 2026 · Artificial Intelligence

LongHorizonUI: A Unified Robust Framework for Long‑Horizon GUI Agent Automation

LongHorizonUI tackles the steep success‑rate drop of GUI agents on tasks longer than 10‑15 steps by introducing three tightly coupled modules—enhanced perception, deep reflective decision, and compensatory execution—and validates the approach on the new LongGUIBench benchmark with consistent performance gains across both app and game scenarios.

GUI automationICLR 2026Long-Horizon Tasks
0 likes · 12 min read
LongHorizonUI: A Unified Robust Framework for Long‑Horizon GUI Agent Automation