How HRM-Text-1B Beats Scaling Laws with 0.1% Data and Hundreds‑Fold Compute Savings

HRM-Text-1B, a brain‑inspired hierarchical language model, achieves strong benchmark scores while using only 0.1% of the training tokens of comparable models, cutting compute costs by 96‑432× through a novel H/L module architecture, MagicNorm stabilization, and a focused instruction‑response training objective.

SuanNi
SuanNi
SuanNi
How HRM-Text-1B Beats Scaling Laws with 0.1% Data and Hundreds‑Fold Compute Savings

Background

Original HRM with 27 M parameters and 1 000 training samples outperformed OpenAI o3‑mini‑high and DeepSeek R1 on the ARC‑AGI‑2 benchmark.

Hierarchical Architecture

HRM‑Text replaces the standard Transformer with two modules: an H module (slow strategic layer) and an L module (fast execution layer). In the forward pass, token embeddings produce an initial high‑level state, then two H‑loops are executed; each loop runs three L‑module updates followed by one H‑module update, totaling eight H/L steps. Parameter sharing keeps the total parameter count at 1 B.

Effective Depth

Layer‑wise difference analysis and logit‑lens KL‑divergence show that every layer continues to produce meaningful representation changes, avoiding the representation‑convergence problem observed in deep standard Transformers.

Gradient‑Stability Techniques

Two mechanisms address gradient explosion/vanishing:

MagicNorm : each loop contains L PreNorm blocks and a final normalization layer, combining forward‑pass stability of PostNorm with backward‑pass stability of PreNorm.

Warmup deep credit assignment : early training back‑propagates gradients through only the last two loop steps, linearly expanding to five steps as training progresses, reducing early computational load and stabilizing learning.

Training Objective

Training uses only instruction‑response pairs, optimizing the negative log‑likelihood of the response. A PrefixLM attention mask applies bidirectional attention to the instruction segment and causal masking to the response segment, increasing signal density compared with full auto‑regressive pre‑training.

Training Cost and Data Efficiency

The 1 B model was trained on 400 billion tokens (40 billion unique) for 600 billion token steps on two 8×H100 nodes, completing in 46 hours at a cost of ≈ $1 472. For comparison, Qwen 3.5 2B used 36 trillion tokens (432× compute), Llama 3.2 3B used 9 trillion tokens (162×), and Gemma 3 4B used 4 trillion tokens (96×).

Benchmark Performance

HRM‑Text achieves 60.7 % on MMLU, 84.5 % on GSM8K (9 points above OLMo 7B), 56.2 % on MATH (best among compared models), 82.2 % on DROP, and 81.9 % on ARC‑C (second only to OLMo). The authors note limited factual knowledge coverage reflected in the MMLU score.

Future Directions

Decoupling reasoning from knowledge is proposed by pairing the compact hierarchical core with external retrieval or learned memory modules such as Engram, aiming to boost factual coverage without sacrificing efficiency.

Open‑Source Release

Code and model weights are available at https://github.com/sapientinc/HRM-Text and https://huggingface.co/sapientinc/HRM-Text-1B. The arXiv pre‑print is https://arxiv.org/pdf/2605.20613.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

benchmarkScaling LawsLanguage ModelHierarchical ArchitectureEfficient PretrainingHRM-Text
SuanNi
Written by

SuanNi

A community for AI developers that aggregates large-model development services, models, and compute power.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.