Artificial Intelligence 99 min read

What Do Leading Open‑Source LLMs Do After Pretraining? A Deep Dive into Post‑Training Strategies

This article surveys the post‑training pipelines of major open‑source large language models released this year, detailing their alignment algorithms, data synthesis, reward modeling, DPO/GRPO variants, long‑context handling, tool use, and model‑averaging techniques, and highlights emerging trends such as data‑centric pipelines and iterative weak‑to‑strong alignment.

Baobao Algorithm Notes

Dec 16, 2024

What Do Leading Open‑Source LLMs Do After Pretraining? A Deep Dive into Post‑Training Strategies

Introduction

The industry has open‑sourced several high‑performance large language models (LLMs) and published technical reports describing their post‑training (fine‑tuning, alignment, and data‑processing) stages. This article compiles the post‑training recipes of the most prominent open‑source LLMs, focusing on training algorithms and data handling.

Llama 3 (Meta)

Qwen 2 (Alibaba Cloud)

Nemotron 4 (NVIDIA)

AFM (Apple)

Yi (01‑AI)

GLM‑4 (Zhipu AI)

Gemma 2 (Google/DeepMind)

DeepSeek‑V2 (DeepSeek)

Baichuan 2 Alignment (Baichuan)

Key observations from the reports:

Data synthesis has become the dominant post‑training paradigm; rapid development of synthesis pipelines is critical for staying ahead.

LLM‑as‑judge and rejection sampling are widely used for preference data construction (e.g., Llama 3, Qwen 2, Baichuan 2, AFM).

The Instag method (originally in Qwen’s report) appears in Llama 3, Qwen 2, and Yi.

Specialized capabilities—code, multilingual, math/reasoning, long‑context, tool use, instruction following—require dedicated optimization.

Model averaging (training multiple checkpoints with different data or hyper‑parameters and averaging weights) improves performance balance (used by Llama 3, Gemma 2, Baichuan 2).

Reinforcement learning is applied mainly via improved DPO variants; PPO is rarely used due to higher engineering overhead.

Preference‑Alignment Techniques (Table Summary)

Llama 3: Iterative DPO

Qwen 2, Yi‑Lightning: Offline DPO + Online DPO

ChatGLM‑4: DPO + PPO

DeepSeek‑V2, Baichuan 2: GRPO

Nemotron‑4: Iterative DPO + RPO

AFM: Combination of RS, DPO, IPO, and a modified online RL called MDLOO

1. Llama 3

Algorithm

Llama 3 performs several post‑training rounds, each consisting of Supervised Fine‑Tuning (SFT) followed by Direct Preference Optimization (DPO). The pipeline starts from a pretrained checkpoint, trains a Reward Model (RM) on human‑annotated preference data, uses the RM for rejection sampling, fine‑tunes the model with the sampled data, and finally runs DPO to align with human preferences. The process repeats for six iterations, continually collecting new preference and SFT data.

Dialogue Format

Llama 3 introduces a multi‑message chat protocol with special header tokens (indicating speaker and role) and termination tokens (signalling turn switches), enabling new capabilities such as tool use.

Reward Modeling (RM)

Beyond binary chosen/rejected pairs, Llama 3 adds a third “edited” response that is an improvement over the chosen one. The ranking order is edited > chosen > rejected. During training, prompts and multiple responses are concatenated into a single line and shuffled to approximate the standard per‑response scoring setup.

SFT

Rejection‑sampled data (generated by the best checkpoint of the previous round) is mixed with other sources (including synthetic data). The largest model uses a learning rate of 1e‑5 and 8.5k–9k training steps per round.

DPO Modifications

Mask formatting tokens (header and termination tokens) in the DPO loss to avoid pathological behavior caused by conflicting gradients.

Add an auxiliary Negative Log‑Likelihood (NLL) regularization term (weight 0.2) to preserve the expected format of chosen responses.

Model Averaging

Different data mixes or hyper‑parameters are used to train multiple checkpoints; their weights are averaged to obtain a more balanced model (following Izmailov et al., 2019; Wortsman et al., 2022; Li et al., 2022).

Data

The post‑training data consists of human‑annotated preference data, SFT data, and extensive quality‑control pipelines.

Preference Data

Annotators rank two model outputs for each user prompt into four grades (significantly better, better, slightly better, marginal). An additional edit step encourages annotators to improve the chosen response, yielding three‑way rankings (edited > chosen > rejected).

SFT Data Sources

Human‑annotated prompts with rejection‑sampled responses.

Synthetic data targeting specific capabilities (see Section 2.4).

A small amount of manually curated data.

Rejection Sampling

For each human‑written prompt, the current best model samples K (10–30) outputs; the RM selects the best candidate. Later stages add system prompts to steer style, tone, or format.

Data Composition

Table 7 (not reproduced) reports statistics of the “usefulness” mixed dataset. SFT and preference data overlap but are curated separately. Section 2.4 describes topic, complexity, and quality classification, and each post‑training round re‑balances these axes.

Data Processing & Quality Control

Because most training data are model‑generated, rigorous cleaning and quality checks are required.

Data Cleaning

Rule‑based removal of over‑apologizing, overused phrases, emojis, and exclamation marks; balancing their proportion in the dataset.

Data Pruning

Topic Classification: A fine‑tuned 8B Llama 3 classifier assigns coarse (e.g., “math reasoning”) and fine (e.g., “geometry”) buckets.

Quality Scoring: Both RM scores and Llama‑based signals are used. Top‑quarter RM scores are considered high‑quality; Llama‑based three‑point scores (accuracy, instruction following, tone) are also applied. Samples flagged high by either RM or Llama are retained.

Difficulty Scoring: Two difficulty metrics are used (Lu et al., 2023 and Liu et al., 2024) based on intent tags and a three‑point difficulty scale.

Semantic Deduplication: RoBERTa embeddings cluster full dialogues; within each cluster, samples are sorted by quality × difficulty, and a greedy selection keeps only those with cosine similarity below a threshold.

Capability‑Specific Augmentation (Section 2.4)

Llama 3 focuses on improving code, multilingual, math/reasoning, long‑context, tool use, and instruction following.

Code

Training Code Experts: Continuous pre‑training on >85% code tokens (≈1 T tokens) following the CodeLlama recipe. The final thousand steps use 16K context length for long‑context fine‑tuning.

Synthetic SFT Data: Three synthesis methods generate >2.7 M code examples, including execution‑feedback loops, problem‑description generation, solution generation, static analysis, unit‑test generation, and error‑feedback self‑correction. Roughly 20 % of initial solutions are erroneous but are corrected through iterative self‑repair.

System‑Prompt‑Guided Rejection Sampling: Specialized system prompts bias the rejection sampler toward readable, well‑documented, and challenging code.

Execution & Model‑as‑Judge Filtering: Binary scores for correctness and style are assigned; only samples scoring 2/2 are kept. Early strict filtering caused regression on challenging benchmarks, so Llama 3 later relaxed the filter for hard examples.

Multilingual

Continuous pre‑training on a multilingual mix (≈90 % multilingual tokens) creates a multilingual expert model, which then follows the same SFT + preference pipeline.

Data composition: 2.4 % human‑annotated, 44.2 % other NLP tasks, 18.8 % rejection‑sampled, 34.6 % translated inference data.

Human annotation, other NLP task data, and rejection‑sampled data are all filtered for quality, language consistency, and safety.

Math & Reasoning

Challenges: lack of prompts, missing reasoning chains, incorrect intermediate steps, and tool integration.

Solutions: mining math data from pre‑training corpora, converting to QA format, manually writing prompts for weak spots, generating step‑by‑step solutions with Llama 3, filtering with step‑wise RM, using Monte‑Carlo Tree Search for hard prompts, and interleaving code execution with reasoning.

Long Context

Final pre‑training expands context from 8K to 128K tokens. Synthetic long‑context data includes multi‑turn QA, long‑document summarization, and code‑repository reasoning. Experiments show that mixing only 0.1 % long‑context synthetic data with short‑context data yields the best performance.

Tool Use

Three tool types are trained: search engine, Python interpreter, and math engine. Zero‑shot tool use is also trained by providing unseen tool definitions in the system prompt.

Tool‑oriented dialogues contain multiple assistant messages; annotators rank or edit the assistant’s response between two tool‑enabled turns.

No rejection sampling is applied for tools because no gains were observed on tool benchmarks.

Zero‑Shot Tool Calls

Single, nested, and parallel function calls are synthesized using the Stack method (Kocetkov et al., 2022) and evaluated with JSON‑formatted calls.

2. Qwen 2

Qwen 2’s post‑training improves coding, math, logical reasoning, instruction following, and multilingual understanding while aligning the model with human values. The pipeline emphasizes minimal human annotation through scalable alignment.

Data Construction

Collaborative Annotation: Automatic ontology extraction from large instruction corpora, followed by human refinement.

Instruction Selection: Diversity, semantic richness, complexity, and intent completeness are evaluated to pick representative instructions.

Instruction Evolution: Existing instructions are enriched with additional constraints using Qwen models (Tree‑Instruct).

Human Annotation: Multiple Qwen model generations are ranked by annotators to create demonstration and preference data.

Automatic Data Synthesis

Rejection Sampling: For tasks with clear answers (math, etc.), multiple model outputs are generated and the best are kept.

Execution Feedback: For coding, generated solutions and test cases are compiled and executed; successful runs become demonstrations.

Data Repurposing: Public‑domain literature is repurposed into role‑play data by pairing detailed character profiles with generated dialogues.

Constitutional Feedback: A set of principle‑based prompts guides the model toward harmlessness.

SFT

Over 500 k examples covering instruction following, coding, math, logic, role‑play, multilingual, and safety are used. Two epochs with a sequence length of 32 k; learning rate decays from 7e‑6 to 7e‑7; weight decay = 0.1; gradient clipping = 1.0.

RLHF

Two stages: offline DPO on a curated preference set, then online RL using a reward model for real‑time feedback. An online merging optimizer mitigates alignment tax.

3. Nemotron‑4

Reward Model (RM)

Nemotron‑4 collects 10 k human preference pairs (HelpSteer2) and trains a regression‑based RM that predicts five utility dimensions (usefulness, correctness, coherence, complexity, detail). The RM head replaces the final softmax with a linear projection; scores are aggregated via weighted sum. This model achieves state‑of‑the‑art results on RewardBench.

Alignment Data

Nemotron‑4’s synthetic data generation (SDG) pipeline creates >98 % synthetic data across five stages: prompt construction, synthetic dialogue generation, synthetic preference creation, iterative weak‑to‑strong alignment, and other sources. Only ~20 k human‑annotated examples are used (10 k for SFT, 10 k for RM/preference).

Prompt Construction

Macro‑topic generation followed by sub‑topic expansion (UltraChat, CAMEL style).

Open‑ended QA prompts, writing prompts, closed‑ended QA prompts, and math/coding prompts are synthesized.

Synthetic Dialogue

Three‑turn dialogues are generated via iterative role‑play; low‑quality polite filler sentences are filtered out. High‑quality samples are scored with Nemotron‑4‑340B‑Reward and low‑scoring ones are discarded.

Synthetic Preference Data

Preference triples (prompt, chosen, rejected) are built from synthetic single‑turn, instruction‑following, two‑turn, and real‑world prompts (ShareGPT, LMSYS, GSM8K, MATH). Multiple random intermediate models generate responses; the best responses (according to the reward model) form the chosen set, while lower‑scoring responses become rejects.

Alignment Algorithms

Supervised fine‑tuning (SFT) is followed by preference fine‑tuning using DPO and a newly proposed Reward‑Aware Preference Optimization (RPO). RPO adds a term that approximates the reward gap between chosen and rejected responses, reducing over‑fitting observed in pure DPO.

\mathcal{L}_{rpo}(x, y_c, y_l) = \mathbb{D} \left[ \beta \log \frac{\pi(y_c \mid x)}{\pi_{ref}(y_c \mid x)} - \beta \log \frac{\pi(y_l \mid x)}{\pi_{ref}(y_l \mid x)} , \eta(r(x, y_c) - r(x, y_l)) \right]

Iterative Weak‑to‑Strong Alignment

Starting from a weaker model (Mixtral‑8x7B‑Instruct) as a data generator, Nemotron‑4 iteratively trains stronger checkpoints; each iteration’s stronger model generates higher‑quality synthetic data for the next round, creating a self‑reinforcing improvement loop.

4. AFM (Apple)

AFM’s post‑training consists of SFT and RLHF, introducing two novel algorithms: iTeC (Iterative Teaching Committee) rejection‑sampling SFT and MDLOO (Mirror‑Descent‑Based Leave‑One‑Out) RLHF.

Data

Human‑annotated demonstrations and preference feedback covering usefulness, safety, truthfulness, and style.

Synthetic data for math, tool use, and coding generated under strong reward‑model guidance.

Math Synthetic Data

Problem re‑phrasing and regression to create variant questions.

Problem evolution (depth and breadth) using instruction‑evolution techniques; difficulty levels are assigned and low‑quality samples are filtered.

Tool‑Use Synthetic Data

Function‑call, code‑interpreter, and browsing data are synthesized; multi‑tool and multi‑step scenarios are added.

Code Synthetic Data

Self‑instruct with rejection sampling: start from 71 programming topics, generate problems, unit tests, and multiple solutions; execution‑based filtering retains only solutions that pass all tests (≈12 k high‑quality triples).

SFT

Data quality guards include human ratings, model‑based automatic filters, and embedding‑based deduplication. Hyper‑parameters: AFM‑Server lr = 5e‑6, AFM‑Device lr = 2e‑5, dropout = 0.1. Checkpoint selection uses a best‑of‑N strategy based on reward‑model scores.

RLHF (MDLOO)

The reward model is trained on human preference triples with graded levels (significantly better, better, slightly better, negligible). MDLOO uses a KL‑penalized objective with a Leave‑One‑Out advantage estimator and Mirror‑Descent Policy Optimization instead of PPO.

R(x, y) = r_{\phi}(x, y) - \beta \log \frac{\pi_{\theta}(y \mid x)}{\pi_{ref}(y \mid x)}

5. Yi (01‑AI)

Yi emphasizes data quality over quantity, following the LIMA and DEITA philosophies. The fine‑tuning set contains <10 k multi‑turn instruction‑response pairs, each iteratively refined.

Data Improvements

Composite instructions derived from WizardLM, with step‑back chain‑of‑thought formatting.

Response format follows a three‑section “intro‑body‑conclusion” style, with bullet‑point bodies.

CoT data uses a “Step‑Back” pattern: high‑level solution first, then detailed reasoning.

Hallucination reduction by ensuring factual consistency with the pre‑training corpus.

Diversity & Mixing

Broad open‑source prompts covering QA, creative writing, dialogue, reasoning, math, coding, safety, and bilingual tasks are collected. An instruction‑tagging system (InsTag) guides balanced sampling across tags and abilities. Grid‑search determines optimal mixing ratios for each capability.

Long‑Context Data

Short data is mixed with synthetic long‑document QA; documents are concatenated, segmented, and QA pairs are generated to encourage retrieval‑style behavior.

Training

Next‑word prediction loss with AdamW (β1 = 0.9, β2 = 0.999, ε = 1e‑8). Sequence length = 4096, batch size = 64, 300 k steps, constant lr = 1e‑5, weight decay = 0.1, gradient clipping = 1.0, NEFTune noise scales (45 for 34B, 5 for 6B).

Safety Alignment

A comprehensive safety taxonomy covering ethics, illegal activity, self‑harm, hate speech, privacy, etc., is used to collect targeted unsafe examples; these are mixed with SFT data and evaluated with adversarial prompts.

Long‑Context Window Support

Engineering improvements (communication overlap, sequence parallelism, compression) enable up to 200 k token context without architectural changes (no sparse or sliding‑window attention).

6. GLM‑4 (Zhipu AI)

GLM‑4‑9B is open‑sourced with a focus on alignment via SFT + RLHF. Key architectural choices include bias‑free QKV, RMSNorm, SwiGLU, 2‑D RoPE, and Grouped Query Attention (GQA) to reduce KV‑cache size.

Alignment

Human‑prompted interactions outperform template‑generated data. RLHF mitigates response rejection, safety issues, bilingual mixing, and multi‑turn consistency.

Post‑Training Techniques

LongAlign: Extends context to 128 k tokens, achieving performance comparable to Claude 2 and GPT‑4‑Turbo‑1106.

ChatGLM‑Math: Self‑critique pipeline for math problem solving.

ChatGLM‑RLHF: Applies PPO and DPO for alignment.

Self‑Contrast: Generates massive positive/negative pairs without human feedback.

AgentTuning: Trains generalized agent abilities using the AgentInstruct dataset.

APAR: Auto‑parallel auto‑regressive decoding for faster hierarchical response generation.

Safety

Each sample is safety‑checked; harmful outputs are removed. A red‑team continuously probes the model, and flagged unsafe Q&A pairs are manually corrected and re‑aligned. Evaluation follows the seven‑dimension framework of Gehman et al., 2023 .

7. Gemma 2 (Google/DeepMind)

Gemma 2 combines a sliding‑window local attention (4096 tokens) with global attention (8192 tokens), RMSNorm, GQA (2 groups), and logit soft‑capping (tanh scaling).

Post‑Training Pipeline

SFT on a mix of synthetic English prompts and human‑annotated instruction‑response pairs.

RLHF using a reward model trained on English preference data; the policy follows the same prompts as SFT.

Model averaging across checkpoints improves overall utility while limiting safety and hallucination risks.

Data Mixing & Filtering

Internal and public data extending the dataset from the Gemma 1.1 paper; LMSYS‑Chat‑1M answers are excluded.

Two‑stage filtering removes personal information, unsafe content, and duplicate examples.

SFT Details

Behavior cloning from larger teacher models, followed by on‑policy distillation (student generates responses, teacher provides KL target) to avoid train‑inference mismatch.

RLHF

Uses a reward model an order of magnitude larger than Gemma 1.1’s, with a stronger focus on multi‑turn dialogue.

Model Merging

Weight‑averaging across checkpoints (Warp) yields balanced performance.

8. DeepSeek‑V2

DeepSeek‑V2 trains on ~1.5 M bilingual examples (1.2 M “useful” – 31.2 % general language, 46.6 % math, 22.2 % code – and 0.3 M safety‑focused). SFT runs for 2 epochs (lr = 5e‑6).

Reinforcement Learning

Adopts Group‑wise Relative Policy Optimization (GRPO), which replaces PPO’s critic with group‑wise baseline scores. The objective maximizes advantage‑scaled policy ratios while penalizing KL divergence to a reference policy.

J_{GRPO}(\theta) = \mathbb{E}_{q, \{o_i\}} \left[ \frac{1}{G}\sum_{i=1}^{G} \left( \min\left( \frac{\pi_{\theta}(o_i|q)}{\pi_{old}(o_i|q)} A_i, \text{clip}\left( \frac{\pi_{\theta}(o_i|q)}{\pi_{old}(o_i|q)}, 1-\epsilon, 1+\epsilon \right) A_i \right) - \beta D_{KL}(\pi_{\theta}\|\pi_{ref}) \right) \right]

Advantage $A_i$ is standardized reward difference across the group.

Reward Model

Three reward signals are combined: helpfulness RM, safety RM, and rule‑based RM, weighted by coefficients $c_1$, $c_2$, $c_3$.

r_i = c_1 \cdot RM_{helpful}(o_i) + c_2 \cdot RM_{safety}(o_i) + c_3 \cdot RM_{rule}(o_i)

9. Baichuan 2 Alignment (Nova Alignment)

Baichuan 2 introduces Nova Alignment, a three‑stage system: Prompt‑Enhancement System (PAS), SFT, and Preference Training (PT). The focus is on instruction following, math, code, long‑context, and tool use.

Optimization

SFT: lr = 1e‑5, epochs = 2‑6, using packing, multi‑layer gradient checkpointing, and sequence parallelism.

Reward Model: Adds point‑wise MSE loss to the usual pairwise preference loss to better fit absolute scores.

RL: Uses GRPO (group‑wise policy optimization) with KL penalties and an additional KL term against the SFT policy to prevent drift.

Prompt‑Enhancement System (PAS)

Automatically generates supplemental prompts that specify application scenarios, user intent expansions, and response format constraints. The system classifies prompts along six dimensions (ability, attribute, domain, language, difficulty, constraint) using a large‑scale classifier fine‑tuned on Baichuan 2‑13B.

Data Pipeline

Prompt selection balances diversity (six dimensions) and quality (clarity, practicality, complexity, novelty) using a multi‑model pairwise ranking framework.

Response construction combines human annotation with rejection sampling; multi‑step tool usage and multi‑turn dialogues are synthesized.

Preference data includes absolute scores for usefulness, safety, and style, as well as pairwise comparisons.

Key Capabilities

Instruction Following: System messages, constraint expansion, response reversal, and textbook‑style prompting.

Math: Prompt collection across K‑12 to university levels, synthetic solution generation with step‑by‑step reasoning, and self‑critique.

Reasoning: Diverse categories (common‑sense, propositional, relational, multi‑step, game theory, adversarial, counter‑factual) with careful difficulty balancing.

Code: Multi‑language generation, static analysis, unit‑test execution, and error‑feedback loops.

Tool Use: Single‑tool, multi‑tool, and multi‑turn scenarios; zero‑shot tool calls are trained via function‑definition prompts.

Overall, the surveyed models share common themes: heavy reliance on data synthesis, iterative weak‑to‑strong alignment loops, reward‑aware preference optimization (DPO, GRPO, RPO), and extensive quality‑control pipelines that combine rule‑based filters, model‑based scoring, and human annotation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM alignment AI research data synthesis post‑training

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Introduction

Preference‑Alignment Techniques (Table Summary)

1. Llama 3

2. Qwen 2

3. Nemotron‑4

4. AFM (Apple)

5. Yi (01‑AI)

6. GLM‑4 (Zhipu AI)

7. Gemma 2 (Google/DeepMind)

8. DeepSeek‑V2

9. Baichuan 2 Alignment (Nova Alignment)

Baobao Algorithm Notes

How this landed with the community

Was this worth your time?

0 Comments

1. Llama 3

2. Qwen 2

7. Gemma 2 (Google/DeepMind)

9. Baichuan 2 Alignment (Nova Alignment)