Training an 11.5 B‑parameter Universal Interatomic Potential in Hours on Exascale Supercomputers

A Chinese Academy of Sciences team introduced the MatRIS‑MoE model and the Janus training framework, enabling a 11.5 billion‑parameter universal machine‑learning interatomic potential to be trained on two exascale systems at 1.2 EFLOPS, compressing weeks‑long training into a few hours.

Data Party THU
Data Party THU
Data Party THU
Training an 11.5 B‑parameter Universal Interatomic Potential in Hours on Exascale Supercomputers

Problem and Constraints

Training universal machine‑learning interatomic potentials (uMLIPs) requires predicting total energy and, via automatic differentiation, obtaining forces and stresses, which introduces second‑order derivatives. Molecular‑dynamics stability demands full‑precision (FP32) arithmetic, and the atom graphs can contain billions of nodes. The combination of second‑order training, FP32 precision, and massive graphs has prevented billion‑parameter universal potentials from being trained in practice.

MatRIS‑MoE Architecture

MatRIS‑MoE extends the invariant MatRIS graph‑based architecture. Atoms, pairwise distances, and three‑body angles are embedded, and additional task, charge/spin, and global‑feature embeddings align heterogeneous datasets, DFT functionals, and system types into a unified representation space. Unlike the single‑task MatRIS, MatRIS‑MoE targets multi‑domain universal modeling.

Mixture‑of‑Experts Design

Two MoE layers are placed before and after the attention mechanism to handle message construction and feature update respectively. Routing is based on element type (Top‑K expert activation per element) rather than instantaneous coordinates, allowing experts to specialize on specific elements and chemical environments, improving cross‑domain expressiveness while keeping expert activation stable for a smooth potential‑energy surface.

Conservative Training Objective

The model first predicts total energy; forces and stresses are obtained by automatic differentiation, preserving physical consistency. A multi‑task robust loss aggregates, for each task, the batch‑wise mean and variance of the loss and applies smooth down‑weighting to outliers, reducing interference among heterogeneous tasks.

Janus Distributed Training Framework

Janus introduces the FS‑3D execution unit that unifies three parallelism strategies:

FSDP (Fully Sharded Data Parallel) shards model parameters, gradients, and optimizer states to lower static memory usage.

FSGP (Graph Sharding) partitions a large atom graph across multiple devices, reducing activation memory.

FSEP (MoE Expert Sharding, based on LAER‑MoE) distributes expert parameters and activates only the needed experts per step.

To avoid activating many unused experts, Janus performs a JIT planning phase before each training step: it batch‑routes all MoE layers, tallies token loads per expert, and plans local and global expert placement, activating only the required experts and balancing load across ranks.

uMLIP training involves three phases—forward, first backward, and second backward (for forces and stresses). Janus records the execution order during the forward pass, reuses it for prefetching and overlapping in later phases, and delays gradient synchronization until the final backward pass, enabling efficient double‑backward training.

Experimental Results

The system was evaluated on two exascale supercomputers. The training dataset comprised 4.73 × 10⁸ atomic configurations (≈3.6 × 10¹² edges) spanning molecules, crystals, catalytic surfaces, molecular crystals, and MOFs. The largest model contained 11.5 B parameters, with 2.89 B active parameters during MoE execution. Janus achieved over 90 % weak‑scaling efficiency and a peak single‑precision performance of 1.2 EFLOPS. Training time decreased from several weeks to a few hours, marking the first practical use of a billion‑parameter universal interatomic potential on supercomputers.

Conclusion

The work demonstrates that universal interatomic potentials can be scaled, trained, and deployed with the same systematic engineering approaches used for large language models, opening a path toward AI‑for‑Science infrastructures.

Reference

Breaking the Training Barrier of Billion‑Parameter Universal Machine Learning Interatomic Potentials , arXiv:2604.15821v1, https://arxiv.org/pdf/2604.15821v1

Code example

来源:ScienceAI
本文
约2200字
,建议阅读
5
分钟
从经典路径推导量子波函数的全新框架。
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

High‑performance computingMixture of ExpertsAI for ScienceExascale trainingML interatomic potentials
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.