Why Redesign the Training Stack? Inside Olmo‑Thinking’s Open‑Source RL Journey

This article provides a detailed technical analysis of the Olmo‑Thinking project, covering why a new open‑source LLM was built, the challenges of reinforcement learning at scale, data‑mix optimization, architectural bottlenecks such as missing GQA and QK‑Norm, and the post‑training techniques used to improve reasoning and long‑context capabilities.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
Why Redesign the Training Stack? Inside Olmo‑Thinking’s Open‑Source RL Journey

Nathan Lambert, a researcher at AI2, explains the motivation behind creating the open‑source Olmo‑Thinking large language model, emphasizing the need for a fully transparent training stack that lets the community see whether performance gains come from genuine algorithmic improvements or from exploiting spurious correlations in the data.

The talk highlights two key observations: (1) Qwen‑2.5 and Qwen‑3 have become popular base models for RL research, raising the question of whether their strong RL benchmark scores stem from memorized pre‑training data or from true learning signals; (2) experiments with random rewards show that apparent improvements can arise from data bias rather than effective RL.

A concrete example demonstrates that Qwen’s base model can solve a math problem it has likely seen during pre‑training, suggesting that RL fine‑tuning may simply be triggering memorized answers instead of teaching genuine reasoning.

Scalability challenges are discussed, notably the lack of GQA (Grouped‑Query‑Attention) in the original Olmo‑2 architecture, which causes memory usage during RL to balloon to the level of a 32B model despite having only 7B parameters. This inefficiency forced a redesign of the training stack and a full re‑pre‑training of the model.

Data selection is tackled with a systematic RegMix approach (see arXiv 2407.01492). The method trains many small models on different data subsets, evaluates them on a fixed validation set, fits a regression model linking data mix ratios to performance metrics (e.g., paloma/c4_en/bpb, lm_eval/averages/macro_avg_acc_norm, lm_eval/mmlu_5shot/choice_logprob_norm), and solves for the optimal mixture using LightGBM. Increasing the proportion of math and code data improves those tasks but slightly hurts broader knowledge benchmarks, illustrating an unavoidable trade‑off.

Long‑context performance is identified as an architectural issue rather than a data problem. Experiments reveal that a specific normalization layer (QK‑Norm) limits the model’s ability to handle extended contexts; five architecture variants trained on the same 2‑trillion‑token corpus show dramatically different results on long‑context tasks.

Post‑training techniques include reasoning‑data distillation, where a strong teacher model (e.g., DeepSeek) generates high‑quality data for supervised fine‑tuning (SFT) of a smaller model, followed by preference tuning (DPO) that yields several percentage points of gain on evaluation sets.

The traditional “model‑pool” method for preference data collection has saturated because modern models produce highly similar responses. A newer “delta learning” hypothesis focuses on the relative difference between chosen and rejected answers, aligning well with contrastive loss functions used in DPO.

Practical RLVR (reinforcement learning with human feedback) engineering challenges are addressed: synchronous pipelines waste GPU cycles, so asynchronous and off‑policy algorithms are adopted despite stability trade‑offs. Solutions include kernel modifications for batch‑invariant behavior, re‑computing log‑probabilities to ensure numerical consistency, importance sampling with advantage re‑weighting, and dynamic weight updates that periodically inject the latest trainer weights into the generator to reduce policy lag.

Images illustrating the training stack, data‑mix experiments, architecture comparisons, and RL system diagrams are retained to support the technical discussion.

# Translated notes from the original Chinese block
# 0. More data generally helps; OT4 simply scales size without changing methodology, gaining a few points.
# 1. Why not use the strongest model as teacher? Sometimes a slightly weaker teacher (e.g., Qwen‑32B) generates better data than a stronger one; the best teacher varies per task.
# 2. Why extend context? Authors claim longer context has little impact; OT3 had 60% of data truncated at 16k, yet the model still learned reasoning patterns, sometimes even benefiting from truncation.
# 3. OT focuses on math, code, STEM; they also add high‑quality chat data (dcft series) labeled by llama‑3.1‑nemotron‑70b.
# 4. Data deduplication similar to OT paper is required.
# 5. SFT on small models is efficient; see the R1 paper for details.
open-source modelsRLVRdata selectionpost-trainingtraining architecture
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.