Artificial Intelligence 9 min read

Can Multi‑Agent AI Generate Conference‑Ready Papers? Inside PaperOrchestra

PaperOrchestra, a multi‑agent collaborative framework, transforms unstructured research notes into LaTeX‑formatted conference papers by automating literature review, chart generation, and drafting, achieving 50‑68% absolute win rates over baselines in human‑like quality evaluations across CVPR and ICLR benchmarks.

PaperAgent

Apr 10, 2026

Can Multi‑Agent AI Generate Conference‑Ready Papers? Inside PaperOrchestra

Overview

PaperOrchestra is a multi‑agent collaboration system designed to turn unstructured research material—such as informal notes and experiment logs—into LaTeX‑ready conference papers that meet the formatting requirements of top venues like CVPR and ICLR. The framework automatically produces deep literature reviews and visualizations, as illustrated by the generated ICLR paper example.

Core Method: Five‑Agent Collaboration

The system decomposes the writing process into five specialized agents that operate sequentially and in parallel:

Agent 1 – Input Parsing: Converts free‑form notes into a structured internal representation.

Agent 2 – Literature Search: Concurrently discovers candidate papers via the Semantic Scholar API.

Agent 3 – Chart Generation: Produces figures and tables from raw experimental data.

Agent 4 – Draft Synthesis: Assembles the literature review, methodology, results, and visualizations into a coherent LaTeX draft.

Agent 5 – Iterative Refinement: Performs multiple optimization passes, respecting strict conference submission cut‑off dates (CVPR Nov 2024, ICLR Oct 2024) to avoid “future‑citation” leakage.

Benchmark: PaperWritingBench

Google introduced PaperWritingBench as the first standardized benchmark for AI‑assisted paper writing. It includes two tracks—CVPR 2025 and ICLR 2025—each containing 100 papers with detailed statistics:

Average citation count: ~58 ± 18

Mandatory citations (P0): ~14 ± 7

Suggested citations (P1): ~44 ± 15

Figures per paper: 5.2 (CVPR) vs 9.2 (ICLR)

Tables per paper: 4.2 (CVPR) vs 8.1 (ICLR)

Experiment‑log length: 1,530 words (CVPR) vs 2,387 words (ICLR)

Human Evaluation: Side‑by‑Side Comparison

Eleven AI researchers evaluated 40 papers (180 blind pairwise comparisons). PaperOrchestra outperformed a single‑agent baseline and the AI‑Scientist‑v2 system:

Literature‑review quality: +67.6% vs single agent, +50.0% vs AI‑Scientist‑v2.

Overall paper quality: +37.8% vs single agent, +13.9% vs AI‑Scientist‑v2.

Compared with human ground truth, PaperOrchestra still lags by 37.8% in overall quality.

Automatic Evaluation: Citation F1 and Paper Quality

Using Gemini‑3.1‑Pro and GPT‑5 as judges, PaperOrchestra achieved higher scores than baselines:

Average citations: 47.98 (PaperOrchestra) vs 14.18 (AI‑Scientist‑v2) vs 11.46 (Single Agent).

P0 recall: 63.58% vs 37.46% vs 28.16%.

P1 recall: 15.85% vs 3.30% vs 3.27%.

Overall F1: 29.65% vs 17.26% vs 11.46%.

Technical Quality Assessment: Simulated Peer Review

Two automated review frameworks were used:

AI‑Scientist‑v2 Reviewer: Acceptance rates – CVPR 5.12/48 % (PaperOrchestra) vs 4.22/22 % (AI‑Scientist‑v2) vs 4.60/33 % (Single Agent); ICLR 4.10/22 % vs 3.42/11 % vs 3.22/4 %.

ScholarPeer: Acceptance rates – CVPR 6.93/84 % (PaperOrchestra) vs 6.30/70 % (AI‑Scientist‑v2) vs 6.35/71 % (Single Agent); ICLR 7.03/81 % vs 6.04/64 % vs 6.35/72 %.

Human ground truth acceptance rates were 5.95/71 % (CVPR) and 5.81/63 % (ICLR), showing PaperOrchestra’s results are close to human performance under ScholarPeer.

Key Insights

Baseline systems cite only 9‑14 papers, resulting in near‑zero P1 recall, indicating shallow exploration of the scholarly landscape.

PaperOrchestra generates 45‑48 citations per paper, approaching the human average of ~59, demonstrating broader academic coverage.

Despite strong performance, PaperOrchestra still trails humans in scientific depth and evidence presentation.

Conclusion

PaperOrchestra proves that a multi‑agent AI can produce near‑human quality conference papers, achieving substantial gains over existing AI writing baselines across human and automated evaluations. The system’s ability to decouple tasks, run in parallel, and enforce strict temporal cut‑offs makes it a promising step toward AI‑driven scientific discovery, though further improvements are needed to fully match human expertise.

https://arxiv.org/pdf/2604.05018
https://yiwen-song.github.io/paper_orchestra/
PaperOrchestra: A Multi‑Agent Framework for Automated AI Research Paper Writing

Artificial Intelligence AI Writing Multi-agent Paper Generation

Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Overview

Core Method: Five‑Agent Collaboration

Benchmark: PaperWritingBench

Human Evaluation: Side‑by‑Side Comparison

Automatic Evaluation: Citation F1 and Paper Quality

Technical Quality Assessment: Simulated Peer Review

Key Insights

Conclusion

PaperAgent

How this landed with the community

Was this worth your time?

0 Comments

Automatic Evaluation: Citation F1 and Paper Quality