From Reasoning to Physical Execution: Peking University Papers Push LLMs Toward Fully Automated Labs
The article analyzes how two Peking University papers presented at ICML 2026 and ACL 2026 introduce BioProBench and BioProAgent to benchmark and enable large language models to safely perform complex wet‑lab experiments, achieving high physical compliance and integrating into a multi‑agent AI4S LAB platform.
BioProBench – Benchmark for Biological Protocol Reasoning
BioProBench is built on the BioProCorpus, a collection of 27,000 human‑written biological protocols. From this corpus the authors generated more than 550,000 task instances covering three evaluation dimensions:
ERR – error‑correction tasks that test the ability to detect and fix protocol mistakes.
REA‑ERR – reasoning‑augmented error‑correction tasks that require quantitative and logical inference.
Safety‑awareness tasks that assess whether a model avoids unsafe actions.
The benchmark was used to evaluate ten mainstream large language models (LLMs). All models achieved high scores on basic comprehension, but their performance dropped sharply on REA‑ERR and safety tasks, confirming the authors’ claim that current LLMs struggle with deep reasoning, quantitative precision, and safety awareness in biological protocols. Detailed results are shown in Tables 2 and 3 of the paper (available at https://arxiv.org/pdf/2505.07889).
Source code and data are released at https://github.com/YuyangSunshine/bioprotocolbench.
BioProAgent – Neural‑Symbolic Execution Framework
BioProAgent addresses the execution gap between probabilistic LLM outputs and irreversible wet‑lab operations. Its architecture combines:
A probabilistic planner that proposes actions.
A deterministic finite‑state machine (FSM) that anchors the plan, guaranteeing that every transition complies with hardware constraints.
A safety‑enhanced planning layer that enforces a strict "design → verify → correct" workflow before any physical command is issued.
Semantic Symbol Grounding, which abstracts complex device contexts into symbolic tokens, reducing token consumption by roughly sixfold.
In benchmark evaluation BioProAgent achieved 95.6 % physical compliance, whereas the ReAct baseline reached only 21.0 %, demonstrating the advantage of neural‑symbolic constraints for reliable autonomous execution. The paper describing BioProAgent is available at https://arxiv.org/pdf/2603.00876.
BIOMA – Native Multi‑Agent System for AI4S LAB
BIOMA integrates BioProBench and BioProAgent into a closed‑loop, wet‑dry research platform (AI4S LAB, https://ai4slab.pkusz.edu.cn/). The system comprises four specialized agents:
PredAgent – generates theoretical hypotheses.
ProAgent – designs experimental plans.
OperAgent – translates abstract protocols into executable command sequences for an automated hardware suite containing more than 22 devices.
ComAgent – performs data analysis on experimental results.
OperAgent’s compilation step bridges the digital protocol and the physical hardware, enabling a seamless wet‑dry closed‑loop workflow. The authors claim this constitutes the first AI‑driven, end‑to‑end cloud‑based research ecosystem that offloads low‑level execution to a reliable autonomous backbone.
Code example
来源:ScienceAI
本文
约2000字
,建议阅读
5
分钟
科学的下一步,已经到来。Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
