Why Post‑Training Matters: Scaling Laws, Fine‑Tuning, and RL Strategies for LLMs
This article explores the importance of post‑training for large language models, explains scaling laws for pre‑ and post‑training, details common fine‑tuning methods (full, PEFT, LoRA), outlines alignment techniques such as RLHF, DPO, PPO, and presents practical workflows using Llama 3 and DeepSeek‑R1, while also discussing test‑time reasoning optimizations.
What Is Post‑Training?
Post‑Training refers to additional training performed on a pretrained model for a specific task or dataset, typically involving fine‑tuning and alignment to adjust the model’s parameters for the new task.
Black: pre‑training stage Red: post‑training stage Purple: inference testing stage
Why Perform Post‑Training?
Post‑Training importance – Post‑Training scaling laws have emerged
Pre‑training scaling law
Computation C, model parameters N, data size D each have a power‑law relationship with performance when not constrained by the others: C ∝ N^α ∝ D^β (illustrated by the formula image).
As model size grows, marginal gains from pre‑training diminish.
RL‑based post‑training is expected to be the next breakthrough.
Autoregressive models struggle with mathematical reasoning because they cannot self‑correct answers; scaling parameters alone yields limited benefits, prompting the search for new scaling laws.
GPT series are typical autoregressive language models. During text generation, the model predicts the next token probability distribution based on the given context (e.g., "The cat").
Common Large‑Model Post‑Training Workflow (Example: Llama 3)
Continuously generate preference‑pair samples via human annotation or synthetic methods.
Train a Reward Model (RM) on these pairs.
Sample K (10–30) responses from the current best model for each prompt, forming <Prompt,Response_k> pairs.
Rank the K responses with the RM and select the top‑N as high‑quality SFT data.
Fine‑tune an SFT model on the selected data.
Align the SFT model using the collected preference data (e.g., DPO).
Iterate steps 1–6 to continuously improve the model.
Training Data
SFT data
Sampling details:
Sample the best‑scoring model or the model that excels in a particular capability.
Sample 10–30 times per prompt.
Prompts are manually labeled; special system prompts are introduced in later iterations.
Preference data
Four preference levels: significantly better, better, slightly better, marginally better.
Annotators may edit the chosen response; final order: edited > chosen > rejected.
Difficulty of prompts increases as the model improves.
Fine‑Tuning
Fine‑tuning adapts a pretrained model to a specific task or domain by further training on task‑specific data.
SFT (Supervised Fine‑Tuning) methods
Full‑parameter fine‑tuning (FFT): updates all model parameters.
Parameter‑efficient fine‑tuning (PEFT): updates only a small subset of parameters.
LoRA (Low‑Rank Adaptation): decomposes weight matrix W into low‑rank matrices A and B, updating only A and B.
Additional trainable tokens, Prompt‑tuning, etc.
LoRA illustration:
Alignment
Alignment adjusts model outputs to match human preferences and values, preventing harmful or unethical content.
Reinforcement Learning from Human Feedback (RLHF) is the core tool, using human‑labeled preference data to train a reward model and optimize the policy.
Human preference labels ( <input,accept,reject>).
Reward Model (RM) → reward signal.
Policy optimization algorithms: DPO (Direct Preference Optimization), PPO (Proximal Policy Optimization), GRPO (Group Relative Policy Optimization).
DPO directly maximizes the probability of preferred outputs without a separate reward model.
PPO builds a loss based on reward signals and limits policy updates for stability.
GRPO is a group‑wise variant of policy optimization.
RL Reward Model Optimization
Traditional RL maximizes cumulative reward but suffers from reward design difficulty. RLHF introduces human feedback (rewards, rankings, preferences) as the reward signal.
LLM as a judge for factuality and style.
Generative RM: chain‑of‑thought reasoning followed by reward.
Critic Model (e.g., CriticGPT) evaluates hidden errors in model outputs.
Outcome‑based RM (ORM) vs. Process‑based RM (PRM) – ORM scores final output, PRM scores each reasoning step.
Test‑Time (Inference) Optimization
Fast (System 1) vs. slow (System 2) reasoning:
Next‑Token Prediction lacks intermediate reasoning, leading to error propagation.
Chain‑of‑Thought (CoT) prompts the model to generate step‑by‑step reasoning.
"Let's think step by step" prompting.
Best‑of‑N sampling.
Monte Carlo Tree Search (MCTS) treats tokens or sentences as nodes and uses process‑based rewards to guide generation.
STaR (Self‑Taught Reasoner) iteratively teaches the model to produce rationales and incorporates them into training.
Quiet‑STaR controls the model to generate rationales during generation, improving reasoning performance.
Comparison: SFT vs. RL
SFT memorizes training data and struggles to generalize out‑of‑distribution, while RL (especially result‑based rewards) generalizes better across rule‑based and visual variations.
RL still relies on SFT to stabilize output format before RL fine‑tuning.
DeepSeek‑R1 Post‑Training Example
DeepSeek‑R1 uses RL only (no SFT), a rule‑based RM, and a custom GPPO optimizer. Test‑time scaling methods like RPM or MCTS were not proven effective.
Summary
RL → DeepSeek‑R1 Zero.
SFT + RL → DeepSeek R1 Llama 3.
SFT → distilled small models.
Test‑time scaling → OpenAI o1.
References
Scaling Laws for Neural Language Models: https://arxiv.org/abs/2001.08361
Reasoning LLMs: https://magazine.sebastianraschka.com/p/understanding-reasoning-llms
NeurIPS tutorial on post‑training: https://www.interconnects.ai/p/the-state-of-post-training-2025
Llama 3.1, DeepSeek‑V3, Tülu 3, Qwen 2.5 post‑training collection: https://zhuanlan.zhihu.com/p/12862210431
The Llama 3 Herd of Models: https://arxiv.org/pdf/2407.21783
Scaling LLM Test‑Time Compute Optimally: https://arxiv.org/pdf/2408.03314
Qwen 2 Technical Report: https://arxiv.org/pdf/2407.10671
DeepSeek R1 Zero paper: https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf
OpenAI o1 analysis: https://zhuanlan.zhihu.com/p/721952915
OpenAI o1 official page: https://openai.com/index/learning-to-reason-with-llms/
CriticGPT: https://arxiv.org/pdf/2407.00215
Fine‑tuning tutorial: https://study.antgroup-inc.cn/learn/course/842000013/content/990000093/990000095?tenant=metastudy
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
