How to Build and Improve Reasoning LLMs: Methods, Trade‑offs, and DeepSeek Insights
This article explains what reasoning language models are, when they are needed, and reviews four main techniques— inference‑time scaling, pure reinforcement learning, combined SFT + RL, and distillation—illustrated with DeepSeek‑R1’s development, cost analysis, and low‑budget alternatives.
What Is a Reasoning Model?
A reasoning model is a large language model (LLM) that is explicitly trained or prompted to solve problems that require a sequence of intermediate steps, such as puzzles, advanced mathematics, or programming challenges. Unlike standard LLMs that often return a single short answer, reasoning models generate a step‑by‑step trace that reveals the underlying thought process.
When to Use Reasoning Models
Reasoning models are most beneficial for tasks that cannot be answered with a single fact lookup—e.g., multi‑step calculations, algorithm design, or logical deduction. For simple summarization, translation, or factual QA, a standard LLM is cheaper and faster; using a reasoning model for those tasks adds latency and cost without improving quality.
DeepSeek‑R1 Training Overview
DeepSeek released three checkpoints that illustrate different training pipelines:
DeepSeek‑R1‑Zero : 671 B DeepSeek‑V3 base model trained **only** with reinforcement learning (RL). No supervised fine‑tuning (SFT) is performed, making it a “cold‑start” RL experiment.
DeepSeek‑R1 : Builds on R1‑Zero by adding a second SFT stage (generated from the Zero model) followed by a second RL stage that introduces a consistency reward to reduce language‑mixing.
DeepSeek‑R1‑Distill : Uses the SFT data produced for R1 to fine‑tune smaller Llama (8 B‑70 B) and Qwen (0.5 B‑32 B) models, creating lightweight reasoning models.
Four Main Methods to Build and Improve Reasoning Models
1. Inference‑time Scaling
At inference, additional compute can be spent to improve answer quality. Typical techniques include:
Chain‑of‑Thought (CoT) prompting: prepend instructions such as “Think step‑by‑step” to force the model to emit intermediate reasoning steps.
Majority voting: generate multiple completions and select the most common answer.
Beam search or other search strategies that explore a larger token space.
These methods increase latency and token usage but require no extra training.
2. Pure Reinforcement Learning (RL)
DeepSeek‑R1‑Zero demonstrates that a model can acquire reasoning behavior by training solely with RL. Two reward signals are used:
Accuracy reward : For programming tasks, solutions are compiled on LeetCode; for math, deterministic evaluators verify the result.
Format reward : A separate LLM reviewer checks that the output follows a prescribed “step‑by‑step” template.
Because no SFT precedes RL, the pipeline is simpler but typically yields lower performance than RL + SFT.
3. Supervised Fine‑Tuning + RL (SFT + RL)
DeepSeek‑R1 adds a “cold‑start” SFT stage: the Zero model generates ~600 k CoT examples and ~200 k knowledge‑based examples, which are then used to fine‑tune the model with instruction tuning. A second RL stage re‑applies the accuracy and format rewards and adds a **consistency reward** that penalizes language‑mixing (e.g., switching between English and Chinese within a single answer). This pipeline yields a substantial jump in benchmark scores over pure RL.
4. Pure SFT and Distillation
Distillation creates smaller, cheaper models by fine‑tuning them on the same SFT data generated by the large checkpoints. The resulting models (e.g., Llama‑8B, Qwen‑32B) are far more efficient and still outperform the pure‑RL baseline, though they do not reach the performance of the RL + SFT flagship.
Advantages and Limitations
Inference‑time scaling : No extra training, but inference cost grows linearly with the number of generated tokens and search passes.
Pure RL : Provides research insight into emergent reasoning, yet usually underperforms RL + SFT on standard benchmarks.
RL + SFT : Currently the most effective recipe for high‑performance reasoning models.
Distillation : Produces cheap, deployable models but cannot push the frontier of capability because it relies on a stronger teacher.
Cost‑Effective Strategies
Training the full DeepSeek‑R1 pipeline likely costs several million dollars. However, two low‑budget projects demonstrate that reasoning ability can be obtained for a fraction of that cost:
TinyZero (3 B parameters) reproduces the RL‑only pipeline of R1‑Zero for under $30.
Sky‑T1 (32 B parameters) was trained with only 17 K SFT examples, costing roughly $450.
These examples show that pure RL or carefully curated SFT data can dramatically reduce compute expense.
Emerging Idea: Journey Learning
Journey Learning extends standard SFT by deliberately inserting **incorrect** solution paths into the training set. The model learns to recognize and correct its own mistakes, similar to the self‑verification observed in TinyZero. This approach may become a practical low‑budget alternative when pure RL is computationally prohibitive.
Summary of the Four Strategies
Inference‑time scaling : Improves performance without model changes; incurs higher latency and token cost.
Pure RL : Shows that reasoning can emerge from reward‑driven training alone; best for research prototypes.
RL + SFT : Combines high‑quality SFT data with RL fine‑tuning; the most reliable path to state‑of‑the‑art reasoning models (e.g., DeepSeek‑R1).
Distillation / pure SFT : Generates smaller, cheaper models by re‑using SFT data; useful when deployment resources are limited.
Practical Recommendations for Limited Budgets
If compute resources are constrained, consider the following workflow:
Collect a modest SFT dataset (e.g., 10 K–20 K high‑quality CoT examples).
Fine‑tune a base model (e.g., Llama‑7B or Qwen‑2.5‑7B) on this data.
If additional performance is needed, apply a lightweight RL stage using simple accuracy rewards (e.g., math evaluator) and a format reward.
Optionally distill the resulting model to an even smaller checkpoint for deployment.
This pipeline mirrors the TinyZero and Sky‑T1 experiments and can be executed on a single GPU cluster for under $500.
References
[1]DeepSeek‑R1 technical report : https://arxiv.org/abs/2501.12948 [2] 2024 AI research paper roundup (part 2): https://magazine.sebastianraschka.com/p/ai-research-papers-2024-part-2 [3] Scaling LLM Test‑Time Compute Optimally Can be More Effective than Scaling Model Parameters: https://arxiv.org/abs/2408.03314 [4] LLM Training: RLHF and Alternatives: https://magazine.sebastianraschka.com/p/llm-training-rlhf-and-its-alternatives [5] TinyZero repository (3 B RL‑only model): https://github.com/Jiayi-Pan/TinyZero/ [6] O1 Replication Journey: A Strategic Progress Report – Part 1 (introduces Journey Learning): https://arxiv.org/abs/2410.18982
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
