Artificial Intelligence 15 min read

97% Accuracy: MOFSeq‑LMM Uses LLMs to Efficiently Predict MOF Synthesizability

A joint Princeton and Colorado School of Mines team introduced MOFSeq‑LMM, a large‑language‑model‑based framework that leverages a million‑scale MOF dataset and a novel string representation to predict free energy with MAE 0.789 kJ/mol and synthesizeability with 97% F1, dramatically accelerating high‑throughput MOF screening.

HyperAI Super Neural

Jan 15, 2026

97% Accuracy: MOFSeq‑LMM Uses LLMs to Efficiently Predict MOF Synthesizability

Researchers from Princeton University and the Colorado School of Mines present MOFSeq‑LMM, a machine‑learning pipeline that predicts the free energy of metal‑organic frameworks (MOFs) directly from their structural sequences, enabling rapid, high‑throughput thermodynamic assessment.

MOFMinE Dataset

The team constructed MOFMinE, comprising roughly one million MOF prototypes generated via the ToBaCCo‑3.0 platform. The dataset encodes full design‑to‑synthesis information, covering 1,393 topology templates, 27 inorganic NBBs, 14 organic NBBs, 19 basic EBBs, and 13 functional modifications. Structural properties span void fractions 0.01–0.99, gravimetric surface areas 26–8,382 m²/g, and pore diameters 2.6–127.7 Å.

A curated subset of 65,574 MOFs includes free‑energy values and serves as the fine‑tuning and test set for the model.

MOFSeq Representation

To overcome limitations of existing descriptors, the authors devised MOFSeq, a compact string‑based representation that encodes both local (atomic composition and intra‑unit connectivity) and global (unit‑level topology and inter‑unit connections) structural features. MOFid tools provide the local identifiers, while ToBaCCo‑3.0 supplies the global topology information.

LLM‑Prop Model Design

Built on the MOFSeq inputs, LLM‑Prop is a lightweight large‑language‑model (≈35 M parameters) with a 2,000‑token context window. It employs attention mechanisms to capture interactions between local and global features, balancing learning capacity and computational efficiency.

Pre‑training

The model is first pre‑trained to predict strain energy—a low‑cost proxy highly correlated with free energy—using 634,463 training samples. Dropout 0.2 yields the best performance (MAE 0.623 kJ/mol, R² 0.965).

Fine‑tuning

Fine‑tuning replaces the target with free‑energy prediction and extends training to 200 epochs. Despite its modest size (≈1/2000 of Llama 2), the model achieves MAE 0.789 kJ/mol and R² 0.990 on unseen MOFs.

Performance Evaluation

Free‑Energy Prediction

On a held‑out test set, LLM‑Prop attains an average absolute error of 0.789 kJ/mol MOF‑atom⁻¹ and R² 0.990, indicating near‑perfect agreement with reference calculations.

Ablation Studies

Only local features (with pre‑training): MAE 1.168 kJ/mol, R² 0.974.

Only global features: MAE < 1.0 kJ/mol, R² ≈ 0.980; pre‑training adds marginal gain.

Combined local + global (with pre‑training): best performance, MAE 0.789 kJ/mol, R² 0.990.

Synthesizability Classification

Using a ΔL_MFFL threshold of 4.4 kJ/mol MOF‑atom, the model classifies MOFs as synthesizable or not, achieving an F1 score of 97% and an AUC of 0.98, meaning false‑positive rates are around 2%.

Polymorph Selection

Across 7,490 polymorph families (2–50 structures each), the model correctly identifies the most stable polymorph in 63% of cases when free‑energy differences are ≤ 0.16 kJ/mol MOF‑atom⁻¹, rising to 89% when differences reach 0.49 kJ/mol MOF‑atom⁻¹. Overall average success rate is ~78%.

Implications

The study demonstrates that a compact LLM trained on a massive, well‑characterized MOF dataset can deliver high‑accuracy thermodynamic predictions and reliable synthesizability assessments, paving the way for AI‑driven high‑throughput materials discovery.

Reference: "Highly Accurate and Fast Prediction of MOF Free Energy via Machine Learning," ACS Publications (doi:10.1021/jacs.5c13960).

machine learning LLM free energy prediction high-throughput screening Materials Informatics MOFs