Tagged articles

Terminal-Bench

10 articles · Page 1 of 1

Jul 25, 2026 · Artificial Intelligence

Why Strong LLMs Fail on Real Terminals and Claude Code + Fable 5 Leads Terminal‑Bench 2.1

Terminal‑Bench 2.1 shows Claude Code + Fable 5 topping the leaderboard with an 83.8% success rate, revealing that raw model strength alone does not guarantee terminal performance and that the engineering of the Agent Harness is the decisive factor.

AI code generationAgent HarnessClaude Code

0 likes · 7 min read

Why Strong LLMs Fail on Real Terminals and Claude Code + Fable 5 Leads Terminal‑Bench 2.1

IT Services Circle

Jul 9, 2026 · Artificial Intelligence

Musk Unveils Grok 4.5: First Benchmarks Reveal Its Strengths and Limits

SpaceXAI's Grok 4.5 arrives ahead of OpenAI's GPT‑5.6 Sol, delivering terminal‑bench scores of 83.3% and multilingual code‑fixing at 78.0%, while offering a striking $2/$6 per‑million‑token price and 80 TPS inference speed, yet still lagging behind top rivals on the hardest SWE‑Bench Pro tasks.

AI coding agentGrok 4.5SWE-bench

0 likes · 6 min read

Musk Unveils Grok 4.5: First Benchmarks Reveal Its Strengths and Limits

Machine Learning Algorithms & Natural Language Processing

Jun 28, 2026 · Artificial Intelligence

OpenAI Unveils GPT‑5.6 ‘Solar System’: Sol, Terra, Luna Redefine Model Capabilities

OpenAI unveiled three GPT‑5.6 models—Sol, Terra and Luna—featuring tiered pricing, record‑breaking benchmark scores in programming, security and biology, new max and ultra inference modes, and a limited early rollout, while also noting several unexpected failure cases.

AnthropicCerebrasExploitBench

0 likes · 9 min read

OpenAI Unveils GPT‑5.6 ‘Solar System’: Sol, Terra, Luna Redefine Model Capabilities

Architect's Tech Stack

Jun 8, 2026 · Artificial Intelligence

Claude 4.8 vs Codex 5.5: Which Code‑Generation Model Performs Better?

The author compares Claude 4.8 (Opus) and Codex 5.5 across SWE‑bench Pro (69.2% vs 58.6%) and Terminal‑Bench (78.2% vs 74.6%), highlighting Claude’s larger 1 M‑token context, higher accuracy on complex multi‑file tasks, and higher cost, while Codex offers faster, cheaper terminal‑focused performance, recommending each for specific scenarios.

AI code generationClaude 4.8Codex 5.5

0 likes · 4 min read

Claude 4.8 vs Codex 5.5: Which Code‑Generation Model Performs Better?

Data Party THU

May 26, 2026 · Artificial Intelligence

Stanford’s LLM-as-a-Verifier Beats Claude Mythos and GPT‑5.5 on Agent Benchmarks

Stanford, Berkeley and Nvidia researchers introduce LLM-as-a-Verifier, a universal verification framework that enhances agent performance, safety and stability on long‑horizon tasks, and outperforms Claude Mythos and GPT‑5.5 on the Terminal‑Bench and SWE‑Bench benchmarks.

AI AgentsAgent verificationLLM-as-a-Verifier

0 likes · 7 min read

Stanford’s LLM-as-a-Verifier Beats Claude Mythos and GPT‑5.5 on Agent Benchmarks

Machine Heart

May 20, 2026 · Artificial Intelligence

Self‑Evolving Harness Engineering Propels GPT‑5.4 to a 7‑Point Gain, Securing a Global Top‑3 Spot

The paper introduces Agentic Harness Engineering (AHE), an observability‑driven framework that automatically evolves coding‑agent harnesses, boosting GPT‑5.4's pass@1 score on Terminal‑Bench 2 from 69.7% to 77.0% (+7.3 points), achieving a worldwide top‑three ranking and demonstrating strong cross‑task and cross‑model generalization.

Agentic Harness EngineeringCross-Model GeneralizationGPT-5.4

0 likes · 14 min read

Self‑Evolving Harness Engineering Propels GPT‑5.4 to a 7‑Point Gain, Securing a Global Top‑3 Spot

Old Zhang's AI Learning

Apr 29, 2026 · Artificial Intelligence

Top 10 Open‑Source LLM Benchmarks: Scores, Rankings, and What They Test

This article walks through ten mainstream open‑source large‑model benchmarks—SWE‑bench Verified and Pro, MMLU‑Pro, GPQA Diamond, HLE, AIME, HMMT, olmOCR‑bench, Terminal‑Bench 2.0, and EvasionBench—explaining their data, evaluation metrics, current leading models, and the capability dimensions they reveal.

AI evaluationLLM benchmarksMMLU-Pro

0 likes · 20 min read

Top 10 Open‑Source LLM Benchmarks: Scores, Rankings, and What They Test

Machine Heart

Apr 26, 2026 · Artificial Intelligence

Surpassing Claude Mythos and GPT‑5.5: Stanford’s New LLM‑as‑a‑Verifier Agent Framework

Stanford, Berkeley and Nvidia introduce LLM‑as‑a‑Verifier, a verification framework that scales verification compute, uses fine‑grained score tokens, repeated checks and criteria decomposition to boost agent performance, eliminate scoring ties and achieve SOTA results on Terminal‑Bench, surpassing Claude Mythos and GPT‑5.5 while improving safety in long‑horizon tasks.

Agent verificationLLMLLM-as-a-Verifier

0 likes · 8 min read

Surpassing Claude Mythos and GPT‑5.5: Stanford’s New LLM‑as‑a‑Verifier Agent Framework

Java Web Project

Apr 25, 2026 · Artificial Intelligence

Why GPT-5.5’s Silent Release Signals Real Engineering Power

OpenAI’s April 23, 2026 launch of GPT-5.5 delivers record‑high scores on SWE‑Bench Pro (58.6%) and Terminal‑Bench 2.0 (82.7%), adds persistent multi‑file context, dynamic reasoning time, and token efficiency, while real‑world case studies show substantial productivity gains across engineering teams.

AI engineeringCodexGPT-5.5

0 likes · 13 min read

Why GPT-5.5’s Silent Release Signals Real Engineering Power

Old Zhang's AI Learning

Feb 22, 2026 · Artificial Intelligence

Why the Minimalist pi Coding Agent Beats Feature‑Heavy Claude Code and OpenCode

In a landscape where Claude Code, Cursor and Windsurf pile on built‑in features, the pi terminal coding agent adopts a minimalist "primitives, not features" philosophy, removes most defaults, offers extensible CLI tools, supports dozens of LLM providers, and outperforms its rivals in Terminal‑Bench 2.0.

Coding AgentExtensionsLLM

0 likes · 16 min read

Why the Minimalist pi Coding Agent Beats Feature‑Heavy Claude Code and OpenCode