Artificial Intelligence 15 min read

How Qwen3 Achieves Multi-Stage Pretraining, Long-Context, and Thought-Controlled RL

The article details Qwen3's three‑phase pretraining pipeline, long‑context extensions, a cold‑start long‑chain‑of‑thought dataset, reinforcement‑learning fine‑tuning with custom rewards, and a two‑stage distillation process that yields versatile, thought‑controlled language models.

Baobao Algorithm Notes

May 13, 2025

How Qwen3 Achieves Multi-Stage Pretraining, Long-Context, and Thought-Controlled RL

This document provides a comprehensive technical overview of the Qwen3 large language model, covering its multi‑stage pretraining, post‑training reinforcement learning, reward design, and distillation strategies.

Pretraining Stage

Qwen3 is trained in three sequential phases:

General Phase (S1) : Over 30 trillion tokens with a sequence length of 4,096 are used to build broad language ability and world knowledge across 119 languages and dialects.

Inference Phase (S2) : An additional ~5 TB of high‑quality tokens (still 4,096 length) enriched with STEM, coding, reasoning, and synthetic data, trained with accelerated learning‑rate decay.

Long‑Context Phase : A dedicated long‑text corpus expands the context window to 32,768 tokens. Approximately 75 % of the data ranges from 16,384 to 32,768 tokens, the rest from 4,096 to 16,384. Techniques such as ABF (raising RoPE base frequency to 1,000,000), YARN, and double‑chunk attention (DCA) increase effective capacity four‑fold.

Scaling laws for optimal learning‑rate schedules and batch sizes are derived from extensive experiments linking architecture, data, and training stage.

Post‑Training Stage

1. Long‑Chain‑of‑Thought Cold Start

A curated dataset spanning mathematics, code, logic, and general STEM problems is built with verified reference answers or test cases. Two‑stage filtering removes unverifiable queries and those answerable without chain‑of‑thought reasoning using Qwen2.5‑72B‑Instruct.

Remaining queries receive N candidate answers generated by QwQ‑32B; low‑quality answers are manually filtered based on error, duplication, speculation, inconsistency, language mixing, or similarity to validation sets. The refined subset seeds the initial reasoning fine‑tuning.

2. Inference Reinforcement Learning

3,995 query‑verifier pairs are collected, each meeting four criteria: unused in cold start, learnable, challenging, and covering diverse sub‑domains. The GRPO algorithm updates model parameters, with large batch sizes, multiple rounds per query, and entropy control to balance exploration and exploitation.

3. Thought‑Mode Fusion

Both "thinking" and "non‑thinking" capabilities are merged to allow dynamic mode switching via special tokens. The chat template inserts /think or /no_think markers in user queries or system messages, guiding the model to produce answers in the corresponding mode. Empty thought blocks are used for non‑thinking samples, and multiple markers may appear in multi‑turn dialogues.

SFT Data Construction : The supervised‑fine‑tuning set combines "think" data (generated by rejecting sampled queries with the second‑stage model) and curated "no‑think" data covering coding, math, instruction following, multilingual tasks, creative writing, QA, and role‑play. Automatic checklists evaluate answer quality, and translation tasks are up‑sampled for low‑resource languages.

Chat Template Design : The template ensures consistent formatting, using /think and /no_think tags (or <think> / </think>) to separate thought and answer sections. When the model reaches a user‑defined thought‑length budget, a stop‑thought directive is inserted to produce a final answer based on accumulated reasoning.

4. General Reinforcement Learning

A reward system spanning more than 20 tasks provides feedback for:

Instruction Following : Accurate compliance with user commands, format, length, and structured output.

Format Adherence : Proper use of /think / /no_think markers and XML‑like tags to delineate reasoning.

Preference Alignment : Enhancing helpfulness, engagement, and style for open‑ended queries.

Tool Use (Agency) : Correct invocation of external tools through multi‑turn interactions.

Domain‑Specific Skills : Rewards for tasks such as retrieval‑augmented generation.

Reward Types :

Rule‑Based Rewards : High‑precision evaluation for instruction and format compliance.

Reference‑Based Model Rewards : Qwen2.5‑72B‑Instruct scores model outputs against provided reference answers.

Reference‑Free Model Rewards : A learned reward model, trained on human preference data, assigns scalar scores without needing a reference answer.

5. Strong‑to‑Weak Distillation

The distillation pipeline creates lightweight student models (0.6B, 1.7B, 4B, 8B, 14B dense and a 30B MoE) that inherit the teacher's thought‑switching ability.

Offline Distillation : Student models are trained on teacher outputs generated under both /think and /no_think modes.

Online Distillation : Students generate answers in the two modes; their logits are aligned with teacher logits (Qwen3‑32B or Qwen3‑235B‑A22B) via KL‑divergence minimization.