Walrus: 1.3B Transformer Model Beats Prior Foundations Across 19 Physics Domains
Walrus, a 1.3 billion‑parameter Transformer built by Polymathic AI, is pretrained on 19 diverse physics scenarios—including astrophysics, geoscience, rheology, plasma physics and acoustics—using techniques like patch jittering, adaptive compute tokenization and space‑time factorized attention, and consistently outperforms earlier foundation models on both short‑ and long‑term continuum dynamics predictions.
Dataset Construction
The pre‑training corpus combines 19 heterogeneous scientific datasets (Well, FlowBench, PDEBench, PDEArena, PDEGym, etc.), covering 63 state variables, multiple governing equations, boundary conditions, and physical parameterizations. Both 2‑D and 3‑D fields are included to ensure spatial‑scale generalisation.
For fine‑tuning, each dataset is split with an 80/10/10 train/validation/test ratio. Pre‑training runs for roughly 400 k optimisation steps; each 2‑D dataset contributes ~4 M samples and each 3‑D dataset ~2 M samples. The AdamW optimiser with a learning‑rate schedule is used, and performance is measured with the standardized root‑mean‑square error (VRMSE) across all tasks.
Space‑Time Factorized Transformer Architecture
Spatial processing: Parallelised attention (Wang) combined with axial RoPE positional encoding.
Temporal processing: Causal attention with T5‑style relative position encoding; QK normalisation is applied in both spatial and temporal modules to improve training stability.
Compute‑adaptive compression: Convolutional Stride Modulation (CSM) in encoder/decoder blocks dynamically adjusts down‑sampling and up‑sampling levels, allowing the model to handle varying resolutions.
Shared encoder‑decoder: A single encoder‑decoder pair is shared among all physical systems of the same dimensionality; separate pairs are used for 2‑D and 3‑D data, with lightweight hierarchical MLPs (hMLP) for encoding and decoding.
RMSGroupNorm & asymmetric normalisation: RMSGroupNorm stabilises training, while asymmetric normalisation of inputs/outputs preserves numerical stability for incremental predictions.
Patch jittering: Random spatial shifts of inputs followed by inverse processing at the output reduce high‑frequency artefacts and markedly improve long‑term prediction stability, especially for ViT‑style backbones.
Efficient multi‑task training: Hierarchical sampling and normalised loss weighting prevent fast‑changing fields from dominating slow‑changing ones; micro‑batching and adaptive tokenisation address load imbalance in high‑dimensional heterogeneous data.
Unified 2‑D/3‑D representation: Zero‑padding a single dimension embeds 2‑D data into a 3‑D space; symmetry‑enhancing augmentations (rotations, reflections) enable cross‑dimensional training.
The architecture therefore processes spatio‑temporal tensors efficiently while supporting diverse, multi‑task training scenarios.
Downstream Performance
When fine‑tuned on a suite of 2‑D and 3‑D benchmark tasks, Walrus outperforms prior foundation models (MPP‑AViT‑L, Poseidon‑L, DPOT‑H). Average VRMSE reductions are ~63.6 % for single‑step prediction, 56.2 % for short‑term trajectory prediction, and 48.3 % for mid‑term trajectory prediction.
Patch jittering yields especially stable long‑term forecasts on non‑chaotic systems; on stochastic systems such as BubbleML’s Pool‑BoilSubcool, early‑time predictions are superior but the advantage diminishes for longer horizons due to limited short‑history information.
On 3‑D tasks that require millions of core‑hours to generate (e.g., post‑neutron‑star‑merger (PNS) and red‑supergiant‑convection (RSG) datasets), Walrus achieves the lowest VRMSE among compared models.
Cross‑domain fine‑tuning on each of the 19 pre‑training datasets shows that Walrus attains the lowest single‑step loss on 18 of 19 tasks. For rolling predictions, it gains an average advantage of 30.5 % over 20 steps and 6.3 % over 20‑60 steps.
Ablation of Pre‑training Strategy
Ablation studies demonstrate that the diverse pre‑training strategy is critical. A half‑sized variant (HalfWalrus) trained only on 2‑D data still outperforms models trained from scratch or with simplistic 2‑D‑only pre‑training on unseen tasks. On 3‑D central‑nervous‑system (CNS) tasks, HalfWalrus provides modest gains despite never seeing 3‑D data, whereas the full Walrus model—trained with both 2‑D and 3‑D data—shows a pronounced performance boost, underscoring the value of multi‑dimensional pre‑training.
Reference: Walrus: A Cross‑Domain Foundation Model for Continuum Dynamics (arXiv:2511.15684).
HyperAI Super Neural
Deconstructing the sophistication and universality of technology, covering cutting-edge AI for Science case studies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
