Artificial Intelligence 17 min read

From 9,874 Papers to 15,000 Structures: MOF‑ChemUnity Rebuilds MOF Knowledge for Explainable AI

MOF‑ChemUnity constructs a scalable, extensible knowledge graph that links millions of MOF names and synonyms to over 15,000 crystal structures using LLM‑driven entity matching, enabling accurate, explainable AI‑assisted material discovery, water‑stability prediction, expert recommendation validation, and graph‑enhanced retrieval across diverse applications.

HyperAI Super Neural

Nov 20, 2025

From 9,874 Papers to 15,000 Structures: MOF‑ChemUnity Rebuilds MOF Knowledge for Explainable AI

Introduction

Metal–Organic Frameworks (MOFs) are versatile materials with high surface area, tunable chemistry, and structural diversity, supporting gas separation, catalysis, and sensing. Over 125,000 MOF frameworks have been synthesized and millions of hypothetical structures predicted, creating a massive, fragmented knowledge space.

Existing AI approaches focus on single‑property extraction from static datasets and struggle with inconsistent naming across literature and databases, hindering robust structure–property linking.

MOF‑ChemUnity Overview

A research team from the University of Toronto and the National Research Council of Canada introduced MOF‑ChemUnity, a structured, scalable, and extensible knowledge graph. By leveraging large language models (LLMs) to map MOF names and their synonyms to Cambridge Structural Database (CSD) entries, the system disambiguates names and establishes reliable one‑to‑one mappings between literature mentions and crystal structures.

In its current version, MOF‑ChemUnity integrates ~10,000 scientific articles and >15,000 CSD crystal structures with computed chemical properties, providing a machine‑operable knowledge source that enhances LLM reasoning.

Dataset: Comprehensive Data View

The knowledge base draws from CoRE MOF 2019 and QMOF, totaling over 31,000 unique crystal structures. Only entries with gas adsorption or band‑structure data and a CSD reference code are retained to ensure traceability.

Full‑text articles from ACS, Elsevier, RSC, and other publishers are harvested via text‑and‑data‑mining (TDM), converted to unified Markdown, and processed by the LLM pipeline.

After applying the matching workflow, 93% of MOF crystal structures (15,143 structures) are linked to 9,874 papers, including resolution of in‑paper aliases such as “Compound 1”.

Experimental properties, synthesis routes, and recommended applications are extracted, yielding >70,000 property records and >2,500 application suggestions.

ChemUnity Knowledge Graph

The graph is designed for three goals: scalability, associability, and queryability. Nodes represent MOFs, publications, synthesis steps, properties, and applications; semantic edges capture relationships. The resulting graph contains >40,000 nodes and 3.2 million edges.

LLM Matching Agent

The first workflow resolves naming entity recognition, coreference, and unique entity linking. The LLM receives CSD‑derived information (reference code, lattice parameters, metal node, space group, molecular formula, chemical name, synonyms) via the CSD Python API and is instructed to match each paper’s MOF name to the correct CSD entry, then to identify all associated references.

Information Extraction Workflow

A generic workflow uses the matched MOF names to drive downstream extraction of properties, recommended applications, and synthesis details. For complex properties such as water stability, a Chain of Verification (CoV) method validates extracted values to reduce hallucinations.

Graph‑Enhanced Retrieval‑Augmented Generation (RAG)

The system retrieves relevant graph information and supplies it as few‑shot context for question answering. Query and Neighbor‑Finder modules are modular and can be invoked by AI agents as needed.

MOF Recommendation and Embedding Space

Chemical and geometric descriptors (RAC, pore volume, pore size, etc.) embed MOFs into a low‑dimensional space. Nearest‑neighbor search recommends similar materials for gas adsorption or carbon capture, translating expert intuition into machine‑learnable rules.

Results Demonstration

Water‑Stability Prediction

A classifier trained on the MOF‑ChemUnity water‑stability dataset achieves 80% accuracy and an F1 score of 86%. The graph also contains CO₂ adsorption data, enabling joint screening for materials that satisfy both criteria.

Expert Recommendation Validation

Experts typically rely on intuition to recommend MOFs. By linking expert recommendations to crystal structures, the system embeds these insights into a structure‑aware chemical space. In methane storage and CO₂ capture tasks, nearest‑neighbor MOFs (model‑recommended) exhibit performance comparable to expert‑chosen materials, demonstrating that the model can learn and generalize expert intuition when mapped to structural space.

For methane storage, expert‑recommended and model‑recommended MOFs show significantly higher average CH₄ uptake than random samples, confirming the value of geometric attributes. In CO₂ capture, expert recommendations perform similarly to random samples, indicating lower reliability of intuition in this domain.

Literature AI Assistant

When querying the standard LLM (GPT‑4o) about the lithium‑based MOF ULMOF‑5 (referred to as “Compound 1”), the model hallucinates and confuses it with Zn‑based MOF‑5. MOF‑ChemUnity correctly associates “Compound 1” with its CSD entry and extracts the sentence “compound 1 is soluble in water,” labeling it as water‑unstable and providing citations.

In a blind evaluation across fact retrieval, structure‑property inference, and material recommendation tasks, the graph‑enhanced assistant consistently outperformed the baseline LLM, receiving higher expert scores for citation support, concrete examples, and verifiable claims.

Extensibility to Other Material Classes

The framework generalizes to other porous materials (covalent organic frameworks, zeolites, polymers) that suffer from heterogeneous naming and data formats. By applying the same entity resolution, relationship modeling, and attribute extraction pipeline, disparate datasets can be unified and made FAIR‑compliant.

Emerging standards such as the IUPAC Adsorption Information File (AIF) can be seamlessly incorporated, ensuring continuous dataset expansion and supporting high‑throughput, multi‑objective material screening.

Future Potential

MOF‑ChemUnity enables natural‑language queries like “Which water‑stable MOFs with high pollutant‑removal capacity contain metal X?” The system returns verifiable answers grounded in literature, experiments, and simulations, establishing a new benchmark for AI‑assisted materials research.