Artificial Intelligence 15 min read

STAR‑PólyaMath Beats GPT‑5.5 by 13.5% on Apex Benchmark Across Eight Major Math Competitions

STAR‑PólyaMath, a multi‑agent reasoning system from T‑STAR Lab and Microsoft Research, introduces an exploration‑reasoning‑verification harness that outperforms GPT‑5.5 on the toughest MathArena Apex 2025 problems by 13.5% and achieves perfect scores on six other top math competition benchmarks.

Machine Heart

Jun 24, 2026

STAR‑PólyaMath Beats GPT‑5.5 by 13.5% on Apex Benchmark Across Eight Major Math Competitions

Motivation

State‑of‑the‑art large language models (LLMs) can generate locally coherent mathematical arguments but often fail to detect dead‑end reasoning paths, accumulate hallucinations, and lack persistent memory of failed attempts.

STAR‑PólyaMath Framework

STAR‑PólyaMath is a multi‑agent system that operates outside the LLM. An orchestrator (a Python script without reasoning capability) cycles three agents— Reasoner , Verifier , and Meta‑Strategist —to implement an exploration‑reasoning‑verification harness. The design follows Pólya’s four‑step problem‑solving cycle (understand, plan, execute, review) and adds persistent meta‑supervision.

Case Study: MathArena Apex 2025 Problem 2

Problem 2 (“The Zigzagging Chessboard”, Turkey TST 2025 P5) asks for the optimal constant k = 1/2. GPT‑5.5 (high‑effort) attempted the problem eight times, succeeding only once and repeatedly converging on a sub‑optimal construction with a wrong answer.

STAR‑PólyaMath’s Reasoner initially produced the incorrect answer 3/4. The Verifier challenged the proof; after three timeout failures the Meta‑Strategist judged the direction fundamentally wrong, prohibited further reasoning on 3/4, and issued a re‑plan. The new plan discovered a denser construction, yielding the correct k = 1/2, which was confirmed by both symbolic derivation and a short Python program that generated a simple connected polygon.

Three Core Difficulties in Long‑Range Reasoning

Hallucination accumulation : Small errors in intermediate steps are amplified because models retain high confidence in their own conclusions.

Cross‑attempt memory loss : When a proof path fails, systems either keep too much context (obscuring the error) or discard crucial information, leading to repeated exploration of disproven directions.

Imbalance between reasoning and tool use : Models trained on tool‑use data may over‑rely on code execution, while pure symbolic reasoning struggles with combinatorial constructions; without meta‑cognitive judgment they cannot decide when to compute versus when to reason.

Agent Roles and Interaction

Reasoner explores problem structure, plans, executes each reasoning or computation step, and defends its arguments when challenged. It retains full memory within a single attempt but resets memory on re‑planning to avoid contaminating new attempts.

Verifier independently audits Reasoner’s output without memory, applying two gates: a Goal Gate (ensuring the declared sub‑goal is met) and a Logic Gate (checking logical correctness). It returns one of four verdicts: Accept, Challenge, Trace‑Back, or Propose‑Replan.

Meta‑Strategist provides persistent, high‑level supervision. It maintains a single long‑lived session that records all attempts, discarded strategies, and recurring failure patterns. When Verifier signals a re‑plan, Meta‑Strategist makes the final decision and can enforce mode switches (e.g., prohibiting code use for purely symbolic problems).

Verification Tags and Adaptive Checking

Each intermediate assertion is labeled as [verified] (code‑executed), [easy‑verify] (simple calculation), or [hard‑verify] (rigorous mathematical review). The Verifier’s scrutiny level follows the tag, trusting code‑based steps quickly while subjecting symbolic steps to strict logical analysis.

Statistical analysis shows that on computation‑heavy contests (AIME, HMMT) 36‑43% of assertions are verified by code, whereas on proof‑heavy contests (IMO, Putnam) over 85% require [hard‑verify].

Error Recovery Mechanisms

The system employs two layered recovery strategies. Trace‑Back rolls back to the offending step, archives the failed branch, and reuses validated intermediate results. Re‑plan aborts the entire plan when Meta‑Strategist judges the overall direction flawed, archives the whole attempt, and bans the failed strategy for future Reasoner runs.

Experimental Results

Using GPT‑5.5 (high‑effort) as the base model, STAR‑PólyaMath achieved the best scores on all eight top‑tier math competition benchmarks (AIME 2025/2026, Putnam 2025, HMMT 2026, etc.). On the hardest MathArena Apex 2025 benchmark it reached 93.75% accuracy versus 80.21% for raw GPT‑5.5, a 13.5% gain.

Runtime analysis shows simple problems (AIME level) finish within eight minutes with minimal Meta‑Strategist involvement, while difficult problems (Apex 2025, IMO 2025) require over 55 minutes and 1.6‑2.2 Meta‑Strategist interventions per problem.

Ablation studies:

Removing both Trace‑Back and Re‑plan mechanisms caused the largest drop on IMO 2025 and Apex 2025, indicating that cross‑step error recovery is critical.

Eliminating Meta‑Strategist’s persistent memory degraded results more than removing Meta‑Strategist entirely, showing that a memory‑less supervisor adds noise.

Disabling Reasoner’s ability to defend against Verifier challenges reduced Putnam 2025 accuracy from 91.67% to 75%, highlighting the importance of bidirectional debate.

Generalization Beyond Mathematics

The core ideas—decomposing long‑range tasks into verifiable sub‑steps, structured cross‑attempt memory, and persistent meta‑supervision—are applicable to any domain requiring traceable, reliable reasoning, such as generate‑test‑debug loops in code generation or hypothesis‑experiment‑review cycles in scientific discovery.

Full code, prompts, and configuration files are publicly released at https://github.com/Julius-Woo/STAR-PolyaMath. The paper is available on arXiv at https://arxiv.org/abs/2605.19338v1.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

GPT-5.5 multi-agent reasoning LLM verification math competition benchmarks STAR-PólyaMath

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Motivation

STAR‑PólyaMath Framework

Case Study: MathArena Apex 2025 Problem 2

Three Core Difficulties in Long‑Range Reasoning

Agent Roles and Interaction

Verification Tags and Adaptive Checking

Error Recovery Mechanisms

Experimental Results

Generalization Beyond Mathematics

Machine Heart

How this landed with the community

Was this worth your time?

0 Comments

Case Study: MathArena Apex 2025 Problem 2