FactReview: An AI‑Agent System for Evidence‑Grounded Peer Review of Papers and Code

FactReview redefines peer review by formalizing it as evidence‑grounded claim assessment, extracting structured statements from papers, locating related literature, and verifying empirical claims through sandboxed code execution, producing a five‑level label report; experiments on CompGCN and backend LLM analyses demonstrate its strengths and current limitations.

AI Agent Research Hub
AI Agent Research Hub
AI Agent Research Hub
FactReview: An AI‑Agent System for Evidence‑Grounded Peer Review of Papers and Code

Motivation and Limitations of Existing LLM Reviewers

Machine‑learning conferences face exploding submission volumes, putting reviewers under severe time pressure (Pineau et al., 2020; Raff, 2019). Recent LLM‑based reviewers such as MARG (D'Arcy et al., 2024), OpenReviewer (Idahl & Ahmadi, 2024), DeepReview (Zhu et al., 2025b) and ReviewerToo (Sahu et al., 2025) generate fluent review texts from the paper alone, but they share three structural weaknesses: (1) excessive sensitivity to rhetorical style, (2) inability to verify empirical claims, and (3) lack of traceable evidence for each judgment.

FactReview’s Core Idea

FactReview treats automated reviewing as a claim‑assessment problem. For each extracted claim the system gathers evidence from three sources—internal paper arguments, retrieved neighboring literature, and execution traces of the accompanying code repository—and assigns one of five evidence‑driven labels (Supported, Supported by the paper, Partially supported, In conflict, Inconclusive). The design principle is that review should be evidence collection, not a final accept/reject decision.

Stage 1: Document Parsing and Claim Extraction

The submitted manuscript is parsed into a structured representation that preserves section boundaries, tables, formulas and figure references. Using the DeepReview v2 prompting strategy, FactReview performs schema‑constrained extraction, producing for each claim its type (empirical, methodological, theoretical, reproducibility), scope (tasks, datasets, metrics) and evidence targets. Wide‑range claims are decomposed into atomic sub‑claims; for example, a statement “our method outperforms all baselines on multiple tasks” is split into separate claims per task‑dataset‑metric triple, allowing independent verification.

Stage 2: Literature Positioning

From the extracted claims FactReview builds a local comparison set by retrieving cited methods, named baselines and semantically similar papers. The module identifies the neighboring method families, highlights design choices that differentiate the target paper, and answers the question “what technical role does this paper play relative to its neighbors?” This provides concrete context for novelty assessment without producing a scalar novelty score.

Stage 3: Execution‑Based Claim Verification

When a code repository is available, FactReview runs a stateful verification workflow :

Repository parsing and environment construction : the artifact is unpacked, a sandboxed workspace is created, and dependencies are installed.

Task planning : commands, configuration files and entry scripts are inspected to derive an explicit list of verification tasks.

Bounded execution : each task runs under explicit time and resource budgets, with logs, return codes and intermediate outputs recorded.

Bounded repair : only environment‑level fixes (e.g., missing dependencies, path adjustments) are permitted; core model code is never altered, preserving verification conservatism.

Claim‑evidence alignment : execution outputs are matched against the numeric values and tables extracted from the paper. Weak or missing alignment yields the Inconclusive label instead of a forced positive/negative judgment.

Five‑Level Claim Label Taxonomy

The label hierarchy distinguishes external verification from internal textual support. “Supported” requires external evidence (code output or literature) that directly backs the claim; “Supported by the paper” relies solely on the paper’s internal argument; “Partially supported” applies when only a subset of decomposed sub‑claims are verified; “In conflict” indicates contradictory external evidence; “Inconclusive” denotes insufficient evidence.

Case Study: CompGCN End‑to‑End Evaluation

FactReview was applied to the CompGCN model (Vashishth et al., 2019), which contains three high‑impact claims:

C1 (empirical): “outperforms all baselines on link prediction, node classification and graph classification.”

C2 (theoretical): “the framework subsumes prior multi‑relational GCNs.”

C3 (scalability): “basis decomposition reduces parameter growth while preserving effectiveness.”

During Stage 3 the system executed the CompGCN repository on four tasks and six datasets. Replicated results matched the paper for link prediction (FB15k‑237 MRR 0.352 vs. reported 0.355; WN18RR MRR 0.477 vs. 0.479) and node classification (MUTAG accuracy 84.9% vs. 85.3%; AM accuracy 90.1% vs. 90.6%). However, for graph classification on MUTAG the reproduced accuracy was 88.4% while the paper’s strongest baseline reported 92.6%, leading FactReview to downgrade C1 from “Supported” to “Partially supported.” C2 received “Supported by the paper” because its evidence is purely mathematical, and C3 was labeled “Supported” based on consistent empirical trends.

Comparison with Pure‑Text LLM Reviewers

The paper contrasts FactReview’s structured output (Figure 3) with a standard pure‑text LLM reviewer (Figure 2). The latter accepts the broad empirical claim wholesale, whereas FactReview isolates each claim, aligns it with task‑level evidence, and demotes over‑general statements when verification fails.

Backend LLM Sensitivity Analysis

FactReview’s verification pipeline was run with six different LLM back‑ends while keeping the CompGCN workflow fixed. Success rates (percentage of verification rounds that produced usable evidence) ranged from 83.3 % for Claude Opus 4.6 down to 41.7 % for Claude Haiku 4.5. Average wall‑clock time varied between 24.1 min (Claude Opus) and 28.9 min (GPT‑4.1), and per‑round API cost spanned $0.68 (Claude Opus) to $0.16 (Claude Haiku). The observations are:

Model families exhibit clear scaling trends (Claude: Opus → Sonnet → Haiku; GPT: 5.4 > 4.1 > 4o).

Performance gaps are largest on complex tasks (graph classification, basis‑decomposition analysis), confirming that execution‑based review is not merely a software automation problem.

Low‑cost models dramatically reduce verification reliability, suggesting that investment in a stronger LLM is justified for high‑stakes peer review.

Failure Analysis of Execution Verification

Across 72 verification rounds (6 models × 12 tasks) FactReview classified failures into three categories:

Artifact‑level (8 cases, 29.6 % of failures, 11.1 % of all rounds): missing entry points or ambiguous repository structure.

Execution‑level (14 cases, 51.9 % of failures, 19.4 % of all rounds): dependency drift, unavailable data/checkpoints, resource mismatches.

Interpretation‑level (5 cases, 18.5 % of failures, 6.9 % of all rounds): output could not be cleanly aligned with paper tables or baselines.

This taxonomy lets FactReview differentiate “negative evidence” (execution conflict) from “missing evidence” (artifact or execution failure), a nuance absent in traditional reviews.

System‑Level Comparison and Positioning

A comparative table (Table 1 in the paper) shows that existing AI reviewers either lack literature retrieval, claim assessment, or execution verification. FactReview uniquely combines all three, plus explicit claim‑evidence linking and the deliberate omission of a final accept/reject recommendation.

Limitations

Evaluation is limited to a single repository (CompGCN); large‑scale multi‑paper studies are absent.

The approach applies only to empirical ML papers that provide runnable code; purely theoretical or dataset‑centric contributions fall outside the current scope.

Bounded repair restricts fixes to environment‑level changes, so minor API deprecations can cause verification failure without reflecting a flaw in the scientific claim.

Claim extraction depends on the underlying LLM’s comprehension, risking missed or mis‑parsed statements.

No user study with human reviewers was conducted, so the impact on real peer‑review decisions remains unmeasured.

References

Xu H, Yue L, Ouyang C, Liu Y, Zheng L, Pan S, Di S, Zhang ML. FactReview: Evidence‑Grounded Reviews with Literature Positioning and Execution‑Based Claim Verification. arXiv preprint arXiv:2604.04074, 2026.

Vashishth S, Sanyal S, Nitin V, Talukdar P. Composition‑based multi‑relational graph convolutional networks. arXiv preprint arXiv:1911.03082, 2019.

Pineau J, Vincent‑Lamarre P, Sinha K, et al. Improving reproducibility in machine learning research (a report from the NeurIPS 2019 reproducibility program). J. Mach. Learn. Res. 2020, 22:164:1–164:20.

Raff E. A step toward quantifying independently reproducible machine learning research. arXiv preprint arXiv:1909.06674, 2019.

D'Arcy M, Hope T, Birnbaum L, Downey D. MARG: Multi‑agent review generation for scientific papers. arXiv preprint arXiv:2401.04259, 2024.

Zhu M, Weng Y, Yang L, Zhang Y. DeepReview: Improving LLM‑based paper review with human‑like deep thinking process. Proc. ACL , 2025, pp. 29330–29355.

Sahu G, Larochelle H, Charlin L, Pal C. ReviewerToo: Should AI join the program committee? arXiv preprint arXiv:2510.08867, 2025.

Idahl M, Ahmadi Z. OpenReviewer: A specialized large language model for generating critical scientific paper reviews. 2024, pp. 550–562.

Li R, Gu JC, Kung PN, et al. LLM‑ReVal: Can we trust LLM reviewers yet? arXiv preprint arXiv:2510.12367, 2025.

Wadden D, Lo K, Wang LL, et al. Fact or fiction: Verifying scientific claims. arXiv preprint arXiv:2004.14974, 2020.

Lu C, Lu C, Lange RT, Foerster J, Clune J, Ha D. The AI Scientist: Towards fully automated open‑ended scientific discovery. arXiv preprint arXiv:2408.06292, 2024.

machine learningLLMAI peer reviewclaim verificationexecution-based evaluationsoftware reproducibility
AI Agent Research Hub
Written by

AI Agent Research Hub

Sharing AI, intelligent agents, and cutting-edge scientific computing

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.