Artificial Intelligence 19 min read

How HRM-Text Achieves 1B‑Parameter, $1K Training Cost and State‑of‑the‑Art Benchmarks

HRM-Text, a 1‑billion‑parameter model trained for under two days on 16 H100 GPUs at a cost of about $1,500, uses a hierarchical recursive architecture, a focused answer‑only loss, and a PrefixLM mask to reach competitive scores on MATH, GSM8K, and ARC‑Challenge, demonstrating an efficient alternative to scaling‑only approaches.

Machine Heart

Jun 9, 2026

How HRM-Text Achieves 1B‑Parameter, $1K Training Cost and State‑of‑the‑Art Benchmarks

Efficient 1B‑Parameter Model with Strong Benchmarks

A model called HRM-Text (≈1 B parameters) achieves 56.2 on MATH, 84.5 on GSM8K, and 81.9 on ARC‑Challenge after training on roughly 40 B unique tokens (≈60 B total tokens). The training cost is about $1,500 and took less than two days on 16 H100 GPUs. The team released the paper, model weights, and pre‑training code.

Motivation: Beyond Scaling Laws

Typical large‑model progress follows the “more parameters + more data + more compute” rule, which makes training increasingly expensive and complex. HRM‑Text asks whether, under limited data and compute, architectural redesign and a new training objective can increase the useful computation per FLOP.

Hierarchical Recursive Architecture

HRM‑Text introduces two modules that operate on different time scales: a high‑level module H and a low‑level module L . During each forward pass the model performs eight recursive updates—six updates of L and two updates of H —before emitting a token. This “multi‑round internal computation” raises the effective depth without adding parameters.

Unlike the common “size‑brain” approach that stitches separate models together, H and L belong to the same network and share a latent space; the information exchanged between them is learned jointly by a single optimizer.

Stability Mechanisms

To keep deep recursion stable, the authors add MagicNorm , an extra normalization step after each recursion that controls activation variance while preserving the PreNorm structure for gradient flow.

They also employ a “warmup deep credit assignment” schedule: early in training gradients are back‑propagated through only the last two recursion steps, and the back‑propagation depth is linearly increased to the last five steps as training stabilizes.

Focused Pre‑training Objective and PrefixLM Mask

Instead of full causal next‑token prediction, HRM‑Text trains on instruction‑answer pairs and computes loss only on the answer tokens. The instruction tokens are made bidirectionally visible, while answer tokens use the standard causal mask. This PrefixLM mask lets the model treat the instruction as an encoded context and the answer as a decoded sequence.

Ablation Results

On ARC‑Challenge, a vanilla 1 B Transformer scores 51.91. Adding answer‑only loss raises it to 62.88; adding PrefixLM further to 74.32; and using the full HRM architecture reaches 81.91. Similar stepwise improvements are observed on MATH (35.44 → 47.04 → 48.36 → 56.16) and GSM8K (48.37 → 69.75 → 75.06 → 84.53). The authors attribute the gains to the combination of hierarchical recursion, concentrated training signal, and the PrefixLM mask.

Data Cleanliness and Generalization

The training data are publicly available and traceable. The team performed strict data‑pollution checks (clean‑split evaluation) and confirmed that the performance advantage persists, indicating that the gains stem from the architecture rather than test‑set leakage.

Comparison with Peer Models

HRM‑Text outperforms several 2‑B‑parameter models on reasoning‑heavy benchmarks (MATH, GSM8K, DROP, ARC‑Challenge) but trails behind larger models on knowledge‑intensive tasks like MMLU (HRM‑Text 60.7 vs. Qwen‑3.5 2B 64.5).

Limitations and Future Directions

With limited data and parameters, the model cannot cover the full breadth of factual knowledge, making it more suitable for task‑execution and reasoning than for a general‑purpose chatbot. The authors propose decoupling the reasoning core from a knowledge store, allowing retrieval or external memory modules to supply facts. Ongoing work (e.g., GRAM by Yoshua Bengio) builds on HRM‑Text’s hierarchical recursion.

Broader Context

HRM‑Text follows earlier HRM‑Symbolic work that applied the same hierarchical recursion to symbolic tasks (e.g., Sudoku, maze solving). The current model demonstrates that the approach can also work for open‑domain language modeling, though scaling to larger sizes, mixture‑of‑experts, or integration with retrieval remains an open research problem.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI Benchmark Recursive Transformer Efficient Pretraining Hierarchical Reasoning Model HRM-Text MagicNorm PrefixLM

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.