How HRM-Text Achieves 1B‑Parameter, $1K Training Cost and State‑of‑the‑Art Benchmarks
HRM-Text, a 1‑billion‑parameter model trained for under two days on 16 H100 GPUs at a cost of about $1,500, uses a hierarchical recursive architecture, a focused answer‑only loss, and a PrefixLM mask to reach competitive scores on MATH, GSM8K, and ARC‑Challenge, demonstrating an efficient alternative to scaling‑only approaches.
Efficient 1B‑Parameter Model with Strong Benchmarks
A model called HRM-Text (≈1 B parameters) achieves 56.2 on MATH, 84.5 on GSM8K, and 81.9 on ARC‑Challenge after training on roughly 40 B unique tokens (≈60 B total tokens). The training cost is about $1,500 and took less than two days on 16 H100 GPUs. The team released the paper, model weights, and pre‑training code.
Motivation: Beyond Scaling Laws
Typical large‑model progress follows the “more parameters + more data + more compute” rule, which makes training increasingly expensive and complex. HRM‑Text asks whether, under limited data and compute, architectural redesign and a new training objective can increase the useful computation per FLOP.
Hierarchical Recursive Architecture
HRM‑Text introduces two modules that operate on different time scales: a high‑level module H and a low‑level module L . During each forward pass the model performs eight recursive updates—six updates of L and two updates of H —before emitting a token. This “multi‑round internal computation” raises the effective depth without adding parameters.
Unlike the common “size‑brain” approach that stitches separate models together, H and L belong to the same network and share a latent space; the information exchanged between them is learned jointly by a single optimizer.
Stability Mechanisms
To keep deep recursion stable, the authors add MagicNorm , an extra normalization step after each recursion that controls activation variance while preserving the PreNorm structure for gradient flow.
They also employ a “warmup deep credit assignment” schedule: early in training gradients are back‑propagated through only the last two recursion steps, and the back‑propagation depth is linearly increased to the last five steps as training stabilizes.
Focused Pre‑training Objective and PrefixLM Mask
Instead of full causal next‑token prediction, HRM‑Text trains on instruction‑answer pairs and computes loss only on the answer tokens. The instruction tokens are made bidirectionally visible, while answer tokens use the standard causal mask. This PrefixLM mask lets the model treat the instruction as an encoded context and the answer as a decoded sequence.
Ablation Results
On ARC‑Challenge, a vanilla 1 B Transformer scores 51.91. Adding answer‑only loss raises it to 62.88; adding PrefixLM further to 74.32; and using the full HRM architecture reaches 81.91. Similar stepwise improvements are observed on MATH (35.44 → 47.04 → 48.36 → 56.16) and GSM8K (48.37 → 69.75 → 75.06 → 84.53). The authors attribute the gains to the combination of hierarchical recursion, concentrated training signal, and the PrefixLM mask.
Data Cleanliness and Generalization
The training data are publicly available and traceable. The team performed strict data‑pollution checks (clean‑split evaluation) and confirmed that the performance advantage persists, indicating that the gains stem from the architecture rather than test‑set leakage.
Comparison with Peer Models
HRM‑Text outperforms several 2‑B‑parameter models on reasoning‑heavy benchmarks (MATH, GSM8K, DROP, ARC‑Challenge) but trails behind larger models on knowledge‑intensive tasks like MMLU (HRM‑Text 60.7 vs. Qwen‑3.5 2B 64.5).
Limitations and Future Directions
With limited data and parameters, the model cannot cover the full breadth of factual knowledge, making it more suitable for task‑execution and reasoning than for a general‑purpose chatbot. The authors propose decoupling the reasoning core from a knowledge store, allowing retrieval or external memory modules to supply facts. Ongoing work (e.g., GRAM by Yoshua Bengio) builds on HRM‑Text’s hierarchical recursion.
Broader Context
HRM‑Text follows earlier HRM‑Symbolic work that applied the same hierarchical recursion to symbolic tasks (e.g., Sudoku, maze solving). The current model demonstrates that the approach can also work for open‑domain language modeling, though scaling to larger sizes, mixture‑of‑experts, or integration with retrieval remains an open research problem.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
