What’s New in Qwen2.5? A Deep Dive into the Latest LLM Advances
The Qwen2.5 Technical Report introduces a new series of large language models with up to 72 B parameters, expanded pre‑training data to 18 trillion tokens, advanced supervised fine‑tuning and reinforcement learning pipelines, and demonstrates strong performance across comprehension, reasoning, coding, and long‑context tasks.
Abstract
Qwen2.5 is a family of large language models (LLMs) that expands the pre‑training corpus from 7 trillion to 18 trillion tokens and introduces a more extensive post‑training pipeline. The enlarged dataset improves commonsense, expert knowledge, and reasoning capabilities. Post‑training adds over one million supervised fine‑tuning samples and a two‑stage reinforcement learning process (offline DPO followed by online GRPO), which enhances human‑preference alignment, long‑text generation, structured data analysis, and instruction following.
Key Features of the Qwen2.5 Series
Rich Configurations : Model sizes range from 0.5 B to 72 B parameters, with both base and instruction‑tuned variants and quantized versions.
Performance : Strong results on benchmarks covering language understanding, reasoning, mathematics, coding, and human‑preference alignment.
Model Scale : The 72 B instruction‑tuned model matches the performance of much larger models such as Llama‑3‑405B‑Instruct.
Architecture and Tokenizer
The series includes dense Transformer models and Mixture‑of‑Experts (MoE) models for API services. Architectural highlights are grouped‑query attention, SwiGLU activation, and rotary positional embeddings (RoPE). The tokenizer is a byte‑level BPE (BBPE) with a vocabulary of 151,643 tokens.
Pre‑training
Data quality improvements involve stricter filtering, integration of mathematics and code data, synthetic data generation, and data mixing. Token count increased to 18 trillion. Long‑context training extends RoPE base from 10,000 to 1,000,000, allowing context lengths up to 32,768 tokens. To support these lengths, YARN (Peng et al., 2023) and Dual Chunk Attention (DCA, An et al., 2024) are employed, enabling Qwen2.5‑Turbo to process up to 1 million tokens (other models up to 131,072 tokens).
Post‑training
Two major advances are introduced: (1) an expanded supervised fine‑tuning dataset covering >1 M samples, and (2) a two‑stage reinforcement learning pipeline—offline Direct Preference Optimization (DPO) followed by online Gradient‑based Reinforcement Preference Optimization (GRPO).
Evaluation
Qwen2.5 models were benchmarked on tasks spanning natural language understanding, programming, mathematics, and multilingual abilities. Both Qwen2.5‑72B and Qwen2.5‑Plus achieve top‑tier scores, comparable to leading open‑weight models.
Conclusion
Qwen2.5 provides a wide range of configurations, strong benchmark performance, flexible architecture, and extensive accessibility, making it a valuable resource for academic research and industrial applications.
Link: https://arxiv.org/pdf/2412.15115
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
