Artificial Intelligence 22 min read

Inside Llama 3.1, DeepSeek‑V3, TÜLU 3 & Qwen 2.5: A Deep Dive into Post‑Training Techniques

This article compiles and analyzes the post‑training pipelines of Llama 3.1, DeepSeek‑V3, TÜLU 3 and Qwen 2.5, detailing their data compositions, SFT, reward modeling, DPO, GRPO, RLVR methods, hyper‑parameters, and practical tricks for large‑language‑model alignment.

Baobao Algorithm Notes

Jan 8, 2025

Inside Llama 3.1, DeepSeek‑V3, TÜLU 3 & Qwen 2.5: A Deep Dive into Post‑Training Techniques

1 Llama 3.1

paper: https://ai.meta.com/research/publications/the-llama-3-herd-of-models/

Llama 3 uses an iterative post‑training pipeline of six rounds. Each round consists of Reward Modeling (RM), Rejection Sampling, Supervised Fine‑Tuning (SFT), and Direct Preference Optimization (DPO). Two data streams are maintained:

SFT data : results of the round’s rejection sampling, synthetic data for targeted capabilities, and a small manually‑annotated subset.

Preference data : newly generated preference pairs each round; the set grows cumulatively.

Model averaging is performed after every RM, SFT, or DPO stage by weighting checkpoints trained with different data mixes or hyper‑parameters.

1.1 SFT

Rejection Sampling – multiple model samples are generated for each prompt; the RM selects the highest‑scoring response to augment the SFT corpus. Typical settings:

Sample the best‑scoring model overall or the model that excels on a specific capability.

Sample 10–30 times per prompt (K=10~30).

Prompts are hand‑written initially; later rounds introduce special system prompts.

SFT training details (example: 405B model) :

Learning rate: 1e‑5.

Training steps: 8.5K–9K.

High‑quality samples are seen 3–4 epochs each; low‑quality samples are down‑sampled and often used only once.

1.2 Preference data

After each round, multiple models (trained with different data mixes or alignment recipes) are deployed. For every user prompt, two distinct models generate responses, increasing diversity.

Four preference grades: significantly better , better , slightly better , marginally better .

Annotators may edit the chosen response; final ordering is edited > chosen > rejected.

Prompt difficulty is gradually increased as model capabilities improve.

1.3 RM & DPO

Reward Modeling – a reward model is trained from scratch each iteration on the full set of preference data, always starting from the pre‑trained checkpoint.

DPO – uses only the most recent batches of preference data collected from the best models of previous rounds. Key settings:

Pairs with grades significantly better or better are kept; similar responses are filtered.

Special tokens (e.g., header or termination tokens) are masked in the loss.

Chosen‑response SFT loss weight: 0.2.

Learning rate: 1e‑5; beta (KL regularization): 0.1.

Observation: training DPO on short contexts does not degrade long‑context performance when the underlying SFT model already excels on long contexts.

1.4 Data cleaning

Pragmatic cleaning removes noisy reply patterns (excessive emojis, repeated exclamation marks) and stereotypical AI phrasing (over‑apologizing, overly polite language).

Additional pipeline steps:

Topic classification : train a classifier on large text‑classification corpora, then assign coarse and fine topics to every sample.

Quality scoring : use a reward model and Llama‑based prompts to score samples; the top 25 % are treated as high‑quality.

Difficulty scoring : employ Instag or Llama‑based prompts to assign a difficulty level based on intent richness.

Semantic deduplication : cluster dialogues with RoBERTa, rank by (quality × difficulty), then greedily keep samples whose cosine similarity to already‑kept items is below a threshold.

2 DeepSeek‑V3

paper: https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf

DeepSeek‑V3 follows an SFT → GRPO post‑training path, with additional experiments on distillation from DeepSeek‑R1, self‑rewarding, and multi‑token prediction.

2.1 SFT

A 1.5 M instruction‑fine‑tuning dataset is built, split into reasoning (math, coding, logic) and non‑reasoning data.

Reasoning data are generated by DeepSeek‑R1, then filtered to reduce over‑thinking, formatting errors, and verbosity.

Two SFT sample formats are produced for each instance:

<question, original answer>

<system prompt, question, R1 answer>

– encourages reflective answering.

During RL, high‑temperature sampling mixes R1‑generated and original data, allowing the expert model to learn a blended style. After RL, rejection sampling selects the highest‑quality SFT data for the final model.

2.2 Reward Modeling

Two reward models are trained:

Rule‑based RM : deterministic checks (e.g., exact answer format for math, test‑case execution for coding).

Model‑based RM : trained on the SFT checkpoint using preference data that include reasoning chains, which mitigates reward hacking on open‑ended tasks.

2.3 Self‑Rewarding

Adopts a Constitutional‑AI style where the model’s own voting results serve as feedback, improving subjective evaluation scores.

2.4 GRPO

GRPO is a simplified PPO variant that removes the value model and computes a baseline advantage from multiple reward samples. The algorithm is described in the DeepSeek‑V2 report.

3 TÜLU 3

paper: https://allenai.org/tulu

TÜLU 3’s pipeline is SFT → DPO → RLVR (RL with verifiable rewards). The report provides extensive RL details.

3.1 SFT

Data sources include WildChat, OpenAssistant, NoRobots, FLAN‑v2, and high‑quality closed‑source responses. No‑response data are distilled from 4‑o models.

Hyper‑parameters :

Batch size: 128.

Maximum sequence length: 4096.

Learning rate: 5e‑6 for 8B models, 2e‑6 for 70B models.

Two epochs of training.

Batch‑aggregation trick – use sum loss (instead of mean loss) so each token receives equal weight, which improves performance under gradient accumulation.

3.2 Preference Finetuning (DPO)

Experiments compare standard DPO, length‑normalized DPO, SimPO, and PPO. Length‑normalized DPO yields the best results; SimPO underperforms the SFT baseline.

Preference data mix includes:

SFT prompts.

WildChat and Persona‑IF prompts.

On‑policy (model‑sampled) and off‑policy (human‑annotated) completions.

Human judges and 4‑o model judges.

Key findings:

Increasing the number of unique prompts improves downstream DPO performance; simple prompt duplication harms it.

On‑policy data (pairs sampled from the current model) outperform off‑policy data.

3.3 RLVR (RL with Verifiable Rewards)

RLVR builds on a rule‑based reward model but uses PPO with a value model initialized from a general RM. Combining verifiable rewards with RM‑based rewards degrades performance, so only the verifiable reward is used during PPO.

4 Qwen 2.5

paper: https://arxiv.org/abs/2412.15115

Qwen 2.5 follows the sequence SFT → DPO → GRPO.

4.1 SFT

A 1 M‑scale instruction dataset is constructed with a maximum sequence length of 32 K. Training runs for two epochs.

Training hyper‑parameters :

Learning rate decays from 7×10⁻⁶ to 7×10⁻⁷.

Weight decay: 0.1.

Gradient norm clipping: 1.

4.2 DPO

Rule‑based data are generated by sampling new prompts with the SFT model, then filtering via execution feedback or answer matching; human review is applied.

Training uses 150 000 preference pairs with standard DPO and the Online Merging Optimizer (learning rate 7×10⁻⁷) for one epoch.

Online Merging Optimizer: https://arxiv.org/abs/2405.17931

4.3 GRPO

GRPO follows the same principle as in DeepSeek‑V3. Prompt ordering during training is determined by the variance of RM reward scores—prompts with higher variance are processed first.

GRPO hyper‑parameters :

Eight reward samples per query.

Global batch size: 2048.

2048 samples per episode.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

RLHF DPO Reward Modeling Qwen2.5 DeepSeek-V3 post-training Llama3.1 TÜLU 3

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

1 Llama 3.1

1.1 SFT

1.2 Preference data

1.3 RM & DPO

1.4 Data cleaning

2 DeepSeek‑V3

2.1 SFT

2.2 Reward Modeling

2.3 Self‑Rewarding

2.4 GRPO

3 TÜLU 3

3.1 SFT

3.2 Preference Finetuning (DPO)

3.3 RLVR (RL with Verifiable Rewards)

4 Qwen 2.5

4.1 SFT

4.2 DPO

4.3 GRPO

Baobao Algorithm Notes

How this landed with the community

Was this worth your time?

0 Comments

1 Llama 3.1

3 TÜLU 3

4 Qwen 2.5