Can Multi‑Agent AI Generate Conference‑Ready Papers? Inside PaperOrchestra
PaperOrchestra, a multi‑agent collaborative framework, transforms unstructured research notes into LaTeX‑formatted conference papers by automating literature review, chart generation, and drafting, achieving 50‑68% absolute win rates over baselines in human‑like quality evaluations across CVPR and ICLR benchmarks.
Overview
PaperOrchestra is a multi‑agent collaboration system designed to turn unstructured research material—such as informal notes and experiment logs—into LaTeX‑ready conference papers that meet the formatting requirements of top venues like CVPR and ICLR. The framework automatically produces deep literature reviews and visualizations, as illustrated by the generated ICLR paper example.
Core Method: Five‑Agent Collaboration
The system decomposes the writing process into five specialized agents that operate sequentially and in parallel:
Agent 1 – Input Parsing: Converts free‑form notes into a structured internal representation.
Agent 2 – Literature Search: Concurrently discovers candidate papers via the Semantic Scholar API.
Agent 3 – Chart Generation: Produces figures and tables from raw experimental data.
Agent 4 – Draft Synthesis: Assembles the literature review, methodology, results, and visualizations into a coherent LaTeX draft.
Agent 5 – Iterative Refinement: Performs multiple optimization passes, respecting strict conference submission cut‑off dates (CVPR Nov 2024, ICLR Oct 2024) to avoid “future‑citation” leakage.
Benchmark: PaperWritingBench
Google introduced PaperWritingBench as the first standardized benchmark for AI‑assisted paper writing. It includes two tracks—CVPR 2025 and ICLR 2025—each containing 100 papers with detailed statistics:
Average citation count: ~58 ± 18
Mandatory citations (P0): ~14 ± 7
Suggested citations (P1): ~44 ± 15
Figures per paper: 5.2 (CVPR) vs 9.2 (ICLR)
Tables per paper: 4.2 (CVPR) vs 8.1 (ICLR)
Experiment‑log length: 1,530 words (CVPR) vs 2,387 words (ICLR)
Human Evaluation: Side‑by‑Side Comparison
Eleven AI researchers evaluated 40 papers (180 blind pairwise comparisons). PaperOrchestra outperformed a single‑agent baseline and the AI‑Scientist‑v2 system:
Literature‑review quality: +67.6% vs single agent, +50.0% vs AI‑Scientist‑v2.
Overall paper quality: +37.8% vs single agent, +13.9% vs AI‑Scientist‑v2.
Compared with human ground truth, PaperOrchestra still lags by 37.8% in overall quality.
Automatic Evaluation: Citation F1 and Paper Quality
Using Gemini‑3.1‑Pro and GPT‑5 as judges, PaperOrchestra achieved higher scores than baselines:
Average citations: 47.98 (PaperOrchestra) vs 14.18 (AI‑Scientist‑v2) vs 11.46 (Single Agent).
P0 recall: 63.58% vs 37.46% vs 28.16%.
P1 recall: 15.85% vs 3.30% vs 3.27%.
Overall F1: 29.65% vs 17.26% vs 11.46%.
Technical Quality Assessment: Simulated Peer Review
Two automated review frameworks were used:
AI‑Scientist‑v2 Reviewer: Acceptance rates – CVPR 5.12/48 % (PaperOrchestra) vs 4.22/22 % (AI‑Scientist‑v2) vs 4.60/33 % (Single Agent); ICLR 4.10/22 % vs 3.42/11 % vs 3.22/4 %.
ScholarPeer: Acceptance rates – CVPR 6.93/84 % (PaperOrchestra) vs 6.30/70 % (AI‑Scientist‑v2) vs 6.35/71 % (Single Agent); ICLR 7.03/81 % vs 6.04/64 % vs 6.35/72 %.
Human ground truth acceptance rates were 5.95/71 % (CVPR) and 5.81/63 % (ICLR), showing PaperOrchestra’s results are close to human performance under ScholarPeer.
Key Insights
Baseline systems cite only 9‑14 papers, resulting in near‑zero P1 recall, indicating shallow exploration of the scholarly landscape.
PaperOrchestra generates 45‑48 citations per paper, approaching the human average of ~59, demonstrating broader academic coverage.
Despite strong performance, PaperOrchestra still trails humans in scientific depth and evidence presentation.
Conclusion
PaperOrchestra proves that a multi‑agent AI can produce near‑human quality conference papers, achieving substantial gains over existing AI writing baselines across human and automated evaluations. The system’s ability to decouple tasks, run in parallel, and enforce strict temporal cut‑offs makes it a promising step toward AI‑driven scientific discovery, though further improvements are needed to fully match human expertise.
https://arxiv.org/pdf/2604.05018
https://yiwen-song.github.io/paper_orchestra/
PaperOrchestra: A Multi‑Agent Framework for Automated AI Research Paper WritingHow this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
