Artificial Intelligence 7 min read

From Reasoning to Physical Execution: Peking University Papers Push LLMs Toward Fully Automated Labs

The article analyzes how two Peking University papers presented at ICML 2026 and ACL 2026 introduce BioProBench and BioProAgent to benchmark and enable large language models to safely perform complex wet‑lab experiments, achieving high physical compliance and integrating into a multi‑agent AI4S LAB platform.

Data Party THU

Jun 22, 2026

From Reasoning to Physical Execution: Peking University Papers Push LLMs Toward Fully Automated Labs

BioProBench – Benchmark for Biological Protocol Reasoning

BioProBench is built on the BioProCorpus, a collection of 27,000 human‑written biological protocols. From this corpus the authors generated more than 550,000 task instances covering three evaluation dimensions:

ERR – error‑correction tasks that test the ability to detect and fix protocol mistakes.

REA‑ERR – reasoning‑augmented error‑correction tasks that require quantitative and logical inference.

Safety‑awareness tasks that assess whether a model avoids unsafe actions.

The benchmark was used to evaluate ten mainstream large language models (LLMs). All models achieved high scores on basic comprehension, but their performance dropped sharply on REA‑ERR and safety tasks, confirming the authors’ claim that current LLMs struggle with deep reasoning, quantitative precision, and safety awareness in biological protocols. Detailed results are shown in Tables 2 and 3 of the paper (available at https://arxiv.org/pdf/2505.07889).

Source code and data are released at https://github.com/YuyangSunshine/bioprotocolbench.

BioProAgent – Neural‑Symbolic Execution Framework

BioProAgent addresses the execution gap between probabilistic LLM outputs and irreversible wet‑lab operations. Its architecture combines:

A probabilistic planner that proposes actions.

A deterministic finite‑state machine (FSM) that anchors the plan, guaranteeing that every transition complies with hardware constraints.

A safety‑enhanced planning layer that enforces a strict "design → verify → correct" workflow before any physical command is issued.

Semantic Symbol Grounding, which abstracts complex device contexts into symbolic tokens, reducing token consumption by roughly sixfold.

In benchmark evaluation BioProAgent achieved 95.6 % physical compliance, whereas the ReAct baseline reached only 21.0 %, demonstrating the advantage of neural‑symbolic constraints for reliable autonomous execution. The paper describing BioProAgent is available at https://arxiv.org/pdf/2603.00876.

BIOMA – Native Multi‑Agent System for AI4S LAB

BIOMA integrates BioProBench and BioProAgent into a closed‑loop, wet‑dry research platform (AI4S LAB, https://ai4slab.pkusz.edu.cn/). The system comprises four specialized agents:

PredAgent – generates theoretical hypotheses.

ProAgent – designs experimental plans.

OperAgent – translates abstract protocols into executable command sequences for an automated hardware suite containing more than 22 devices.

ComAgent – performs data analysis on experimental results.

OperAgent’s compilation step bridges the digital protocol and the physical hardware, enabling a seamless wet‑dry closed‑loop workflow. The authors claim this constitutes the first AI‑driven, end‑to‑end cloud‑based research ecosystem that offloads low‑level execution to a reliable autonomous backbone.

Code example

来源：ScienceAI
本文
约2000字
，建议阅读
5
分钟
科学的下一步，已经到来。

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM benchmark multi-agent systems Neural-symbolic AI for Science BioProAgent BioProBench

Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

BioProBench – Benchmark for Biological Protocol Reasoning

BioProAgent – Neural‑Symbolic Execution Framework

BIOMA – Native Multi‑Agent System for AI4S LAB

Code example

Data Party THU

How this landed with the community

Was this worth your time?

0 Comments

BIOMA – Native Multi‑Agent System for AI4S LAB