Why Post‑Training Matters: Scaling Laws, Fine‑Tuning, and RL Strategies for LLMs

This article explores the importance of post‑training for large language models, explains scaling laws for pre‑ and post‑training, details common fine‑tuning methods (full, PEFT, LoRA), outlines alignment techniques such as RLHF, DPO, PPO, and presents practical workflows using Llama 3 and DeepSeek‑R1, while also discussing test‑time reasoning optimizations.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
Why Post‑Training Matters: Scaling Laws, Fine‑Tuning, and RL Strategies for LLMs

What Is Post‑Training?

Post‑Training refers to additional training performed on a pretrained model for a specific task or dataset, typically involving fine‑tuning and alignment to adjust the model’s parameters for the new task.

Black: pre‑training stage Red: post‑training stage Purple: inference testing stage

Why Perform Post‑Training?

Post‑Training importance – Post‑Training scaling laws have emerged

Pre‑training scaling law

Computation C, model parameters N, data size D each have a power‑law relationship with performance when not constrained by the others: C ∝ N^α ∝ D^β (illustrated by the formula image).

As model size grows, marginal gains from pre‑training diminish.

RL‑based post‑training is expected to be the next breakthrough.

Autoregressive models struggle with mathematical reasoning because they cannot self‑correct answers; scaling parameters alone yields limited benefits, prompting the search for new scaling laws.

GPT series are typical autoregressive language models. During text generation, the model predicts the next token probability distribution based on the given context (e.g., "The cat").

Common Large‑Model Post‑Training Workflow (Example: Llama 3)

Continuously generate preference‑pair samples via human annotation or synthetic methods.

Train a Reward Model (RM) on these pairs.

Sample K (10–30) responses from the current best model for each prompt, forming <Prompt,Response_k> pairs.

Rank the K responses with the RM and select the top‑N as high‑quality SFT data.

Fine‑tune an SFT model on the selected data.

Align the SFT model using the collected preference data (e.g., DPO).

Iterate steps 1–6 to continuously improve the model.

Training Data

SFT data

Sampling details:

Sample the best‑scoring model or the model that excels in a particular capability.

Sample 10–30 times per prompt.

Prompts are manually labeled; special system prompts are introduced in later iterations.

Preference data

Four preference levels: significantly better, better, slightly better, marginally better.

Annotators may edit the chosen response; final order: edited > chosen > rejected.

Difficulty of prompts increases as the model improves.

Fine‑Tuning

Fine‑tuning adapts a pretrained model to a specific task or domain by further training on task‑specific data.

SFT (Supervised Fine‑Tuning) methods

Full‑parameter fine‑tuning (FFT): updates all model parameters.

Parameter‑efficient fine‑tuning (PEFT): updates only a small subset of parameters.

LoRA (Low‑Rank Adaptation): decomposes weight matrix W into low‑rank matrices A and B, updating only A and B.

Additional trainable tokens, Prompt‑tuning, etc.

LoRA illustration:

Alignment

Alignment adjusts model outputs to match human preferences and values, preventing harmful or unethical content.

Reinforcement Learning from Human Feedback (RLHF) is the core tool, using human‑labeled preference data to train a reward model and optimize the policy.

Human preference labels ( <input,accept,reject>).

Reward Model (RM) → reward signal.

Policy optimization algorithms: DPO (Direct Preference Optimization), PPO (Proximal Policy Optimization), GRPO (Group Relative Policy Optimization).

DPO directly maximizes the probability of preferred outputs without a separate reward model.

PPO builds a loss based on reward signals and limits policy updates for stability.

GRPO is a group‑wise variant of policy optimization.

RL Reward Model Optimization

Traditional RL maximizes cumulative reward but suffers from reward design difficulty. RLHF introduces human feedback (rewards, rankings, preferences) as the reward signal.

LLM as a judge for factuality and style.

Generative RM: chain‑of‑thought reasoning followed by reward.

Critic Model (e.g., CriticGPT) evaluates hidden errors in model outputs.

Outcome‑based RM (ORM) vs. Process‑based RM (PRM) – ORM scores final output, PRM scores each reasoning step.

Test‑Time (Inference) Optimization

Fast (System 1) vs. slow (System 2) reasoning:

Next‑Token Prediction lacks intermediate reasoning, leading to error propagation.

Chain‑of‑Thought (CoT) prompts the model to generate step‑by‑step reasoning.

"Let's think step by step" prompting.

Best‑of‑N sampling.

Monte Carlo Tree Search (MCTS) treats tokens or sentences as nodes and uses process‑based rewards to guide generation.

STaR (Self‑Taught Reasoner) iteratively teaches the model to produce rationales and incorporates them into training.

Quiet‑STaR controls the model to generate rationales during generation, improving reasoning performance.

Comparison: SFT vs. RL

SFT memorizes training data and struggles to generalize out‑of‑distribution, while RL (especially result‑based rewards) generalizes better across rule‑based and visual variations.

RL still relies on SFT to stabilize output format before RL fine‑tuning.

DeepSeek‑R1 Post‑Training Example

DeepSeek‑R1 uses RL only (no SFT), a rule‑based RM, and a custom GPPO optimizer. Test‑time scaling methods like RPM or MCTS were not proven effective.

Summary

RL → DeepSeek‑R1 Zero.

SFT + RL → DeepSeek R1 Llama 3.

SFT → distilled small models.

Test‑time scaling → OpenAI o1.

References

Scaling Laws for Neural Language Models: https://arxiv.org/abs/2001.08361

Reasoning LLMs: https://magazine.sebastianraschka.com/p/understanding-reasoning-llms

NeurIPS tutorial on post‑training: https://www.interconnects.ai/p/the-state-of-post-training-2025

Llama 3.1, DeepSeek‑V3, Tülu 3, Qwen 2.5 post‑training collection: https://zhuanlan.zhihu.com/p/12862210431

The Llama 3 Herd of Models: https://arxiv.org/pdf/2407.21783

Scaling LLM Test‑Time Compute Optimally: https://arxiv.org/pdf/2408.03314

Qwen 2 Technical Report: https://arxiv.org/pdf/2407.10671

DeepSeek R1 Zero paper: https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf

OpenAI o1 analysis: https://zhuanlan.zhihu.com/p/721952915

OpenAI o1 official page: https://openai.com/index/learning-to-reason-with-llms/

CriticGPT: https://arxiv.org/pdf/2407.00215

Fine‑tuning tutorial: https://study.antgroup-inc.cn/learn/course/842000013/content/990000093/990000095?tenant=metastudy

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMFine-tuningAlignmentRLHFpost-trainingscaling-laws
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.