97% Accuracy: MOFSeq‑LMM Uses LLMs to Efficiently Predict MOF Synthesizability
A joint Princeton and Colorado School of Mines team introduced MOFSeq‑LMM, a large‑language‑model‑based framework that leverages a million‑scale MOF dataset and a novel string representation to predict free energy with MAE 0.789 kJ/mol and synthesizeability with 97% F1, dramatically accelerating high‑throughput MOF screening.
Researchers from Princeton University and the Colorado School of Mines present MOFSeq‑LMM, a machine‑learning pipeline that predicts the free energy of metal‑organic frameworks (MOFs) directly from their structural sequences, enabling rapid, high‑throughput thermodynamic assessment.
MOFMinE Dataset
The team constructed MOFMinE, comprising roughly one million MOF prototypes generated via the ToBaCCo‑3.0 platform. The dataset encodes full design‑to‑synthesis information, covering 1,393 topology templates, 27 inorganic NBBs, 14 organic NBBs, 19 basic EBBs, and 13 functional modifications. Structural properties span void fractions 0.01–0.99, gravimetric surface areas 26–8,382 m²/g, and pore diameters 2.6–127.7 Å.
A curated subset of 65,574 MOFs includes free‑energy values and serves as the fine‑tuning and test set for the model.
MOFSeq Representation
To overcome limitations of existing descriptors, the authors devised MOFSeq, a compact string‑based representation that encodes both local (atomic composition and intra‑unit connectivity) and global (unit‑level topology and inter‑unit connections) structural features. MOFid tools provide the local identifiers, while ToBaCCo‑3.0 supplies the global topology information.
LLM‑Prop Model Design
Built on the MOFSeq inputs, LLM‑Prop is a lightweight large‑language‑model (≈35 M parameters) with a 2,000‑token context window. It employs attention mechanisms to capture interactions between local and global features, balancing learning capacity and computational efficiency.
Pre‑training
The model is first pre‑trained to predict strain energy—a low‑cost proxy highly correlated with free energy—using 634,463 training samples. Dropout 0.2 yields the best performance (MAE 0.623 kJ/mol, R² 0.965).
Fine‑tuning
Fine‑tuning replaces the target with free‑energy prediction and extends training to 200 epochs. Despite its modest size (≈1/2000 of Llama 2), the model achieves MAE 0.789 kJ/mol and R² 0.990 on unseen MOFs.
Performance Evaluation
Free‑Energy Prediction
On a held‑out test set, LLM‑Prop attains an average absolute error of 0.789 kJ/mol MOF‑atom⁻¹ and R² 0.990, indicating near‑perfect agreement with reference calculations.
Ablation Studies
Only local features (with pre‑training): MAE 1.168 kJ/mol, R² 0.974.
Only global features: MAE < 1.0 kJ/mol, R² ≈ 0.980; pre‑training adds marginal gain.
Combined local + global (with pre‑training): best performance, MAE 0.789 kJ/mol, R² 0.990.
Synthesizability Classification
Using a ΔL_MFFL threshold of 4.4 kJ/mol MOF‑atom, the model classifies MOFs as synthesizable or not, achieving an F1 score of 97% and an AUC of 0.98, meaning false‑positive rates are around 2%.
Polymorph Selection
Across 7,490 polymorph families (2–50 structures each), the model correctly identifies the most stable polymorph in 63% of cases when free‑energy differences are ≤ 0.16 kJ/mol MOF‑atom⁻¹, rising to 89% when differences reach 0.49 kJ/mol MOF‑atom⁻¹. Overall average success rate is ~78%.
Implications
The study demonstrates that a compact LLM trained on a massive, well‑characterized MOF dataset can deliver high‑accuracy thermodynamic predictions and reliable synthesizability assessments, paving the way for AI‑driven high‑throughput materials discovery.
Reference: "Highly Accurate and Fast Prediction of MOF Free Energy via Machine Learning," ACS Publications (doi:10.1021/jacs.5c13960).
HyperAI Super Neural
Deconstructing the sophistication and universality of technology, covering cutting-edge AI for Science case studies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
