How HRM-Text-1B Beats Scaling Laws with 0.1% Data and Hundreds‑Fold Compute Savings
HRM-Text-1B, a brain‑inspired hierarchical language model, achieves strong benchmark scores while using only 0.1% of the training tokens of comparable models, cutting compute costs by 96‑432× through a novel H/L module architecture, MagicNorm stabilization, and a focused instruction‑response training objective.
Background
Original HRM with 27 M parameters and 1 000 training samples outperformed OpenAI o3‑mini‑high and DeepSeek R1 on the ARC‑AGI‑2 benchmark.
Hierarchical Architecture
HRM‑Text replaces the standard Transformer with two modules: an H module (slow strategic layer) and an L module (fast execution layer). In the forward pass, token embeddings produce an initial high‑level state, then two H‑loops are executed; each loop runs three L‑module updates followed by one H‑module update, totaling eight H/L steps. Parameter sharing keeps the total parameter count at 1 B.
Effective Depth
Layer‑wise difference analysis and logit‑lens KL‑divergence show that every layer continues to produce meaningful representation changes, avoiding the representation‑convergence problem observed in deep standard Transformers.
Gradient‑Stability Techniques
Two mechanisms address gradient explosion/vanishing:
MagicNorm : each loop contains L PreNorm blocks and a final normalization layer, combining forward‑pass stability of PostNorm with backward‑pass stability of PreNorm.
Warmup deep credit assignment : early training back‑propagates gradients through only the last two loop steps, linearly expanding to five steps as training progresses, reducing early computational load and stabilizing learning.
Training Objective
Training uses only instruction‑response pairs, optimizing the negative log‑likelihood of the response. A PrefixLM attention mask applies bidirectional attention to the instruction segment and causal masking to the response segment, increasing signal density compared with full auto‑regressive pre‑training.
Training Cost and Data Efficiency
The 1 B model was trained on 400 billion tokens (40 billion unique) for 600 billion token steps on two 8×H100 nodes, completing in 46 hours at a cost of ≈ $1 472. For comparison, Qwen 3.5 2B used 36 trillion tokens (432× compute), Llama 3.2 3B used 9 trillion tokens (162×), and Gemma 3 4B used 4 trillion tokens (96×).
Benchmark Performance
HRM‑Text achieves 60.7 % on MMLU, 84.5 % on GSM8K (9 points above OLMo 7B), 56.2 % on MATH (best among compared models), 82.2 % on DROP, and 81.9 % on ARC‑C (second only to OLMo). The authors note limited factual knowledge coverage reflected in the MMLU score.
Future Directions
Decoupling reasoning from knowledge is proposed by pairing the compact hierarchical core with external retrieval or learned memory modules such as Engram, aiming to boost factual coverage without sacrificing efficiency.
Open‑Source Release
Code and model weights are available at https://github.com/sapientinc/HRM-Text and https://huggingface.co/sapientinc/HRM-Text-1B. The arXiv pre‑print is https://arxiv.org/pdf/2605.20613.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
SuanNi
A community for AI developers that aggregates large-model development services, models, and compute power.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
