Can Large Language Models Design Chemical Synthesis? ChemReason‑Bench Exposes AI’s Logic Gaps

The ChemReason‑Bench benchmark, introduced by Shanghai Jiao Tong University, evaluates large language models on six program‑reasoning tasks for chemical synthesis, revealing that while top general models show modest reasoning ability, step‑completion remains difficult and domain‑specific models lag behind, prompting new training datasets for improvement.

Data Party THU
Data Party THU
Data Party THU
Can Large Language Models Design Chemical Synthesis? ChemReason‑Bench Exposes AI’s Logic Gaps

Background

Automation of organic synthesis, material development, and drug screening is limited more by the inability of large language models (LLMs) to reason about experimental protocols than by robotic precision. Existing chemistry LLM evaluations focus on knowledge‑question answering, which does not test the cross‑step constraints required for executable procedures.

Benchmark Design – ChemReason‑Bench

ChemReason‑Bench is a large‑scale, human‑validated benchmark for experimental procedure reasoning. It is built on 500 organic reactions and contains 7,306 manually verified instances. Each instance is framed in a structured template with explicit placeholders, enabling automatic verification of operational constraints.

The benchmark decomposes experimental program reasoning into six complementary abilities:

Step ordering – select the correct sequence among candidate operations.

Step validation – decide whether a candidate action is feasible in the current experimental context.

Condition validation – assess the reasonableness of temperature, duration, etc.

Step completion – generate a missing operation that satisfies surrounding constraints.

Contrast selection – identify the correct role of a substance among confusing alternatives.

Principle explanation – provide the causal logic behind an operation or condition.

This multi‑task design allows researchers to plot ability‑radar charts for each model, revealing strengths and weaknesses across dimensions.

Evaluation Results

Eighteen models—including open‑source, closed‑source, and chemistry‑specialized systems such as GPT‑5.2, DeepSeek‑v3.2, Llama‑3.1, ChemLLM, ChemDFM, and LlaSMol—were evaluated uniformly on ChemReason‑Bench. Key findings:

Top general‑purpose models exhibit emerging program‑reasoning capability. GPT‑5.2 achieved an overall score of 70.30; DeepSeek‑v3.2 scored 65.21.

Step completion is the hardest task. Even the best model (GPT‑5.2) reached only 51.65, indicating difficulty in generating fully correct structured steps under strict constraints.

Chemistry‑specific models underperformed relative to general models, showing that exposure to chemical corpora alone does not confer procedural reasoning ability.

Some models displayed inconsistency between free‑text generation and discrete decision outputs, revealing instability in their decision processes.

Training Set and Fine‑tuning – ChemReason‑TUNE

The authors released ChemReason‑TUNE, a training set comprising more than 120 000 task instances derived from the benchmark schema. Fine‑tuning a 2‑9 B‑parameter model (Gemma‑2‑9B) on ChemReason‑TUNE yielded performance comparable to leading closed‑source systems, demonstrating a viable path toward lightweight, locally deployable AI assistants for laboratory use.

Resources

Benchmark code, data, and detailed documentation are openly available at https://openreview.net/forum?id=aVXpKdGUFx and https://github.com/Khadaz/ChemReason-Bench.

Code example

来源:ScienceAI
本文
约1000字
,建议阅读
5
分钟
即便最强大的AI,距离可靠的化学家仍有不小的距离。
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large language modelsbenchmarkAI chemistrychemical synthesisChemReason-Benchprogram reasoning
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.