Artificial Intelligence 7 min read

Inside Qwen: A Deep Dive into the Large Model’s Source Code

The article provides a comprehensive technical walkthrough of Qwen’s large‑model series, covering data preparation, tokenization, model tweaks, training settings, RLHF pipeline, Code‑Qwen specifics, Qwen2 and Qwen3 architectural changes, scaling‑law experiments, and detailed source‑code analysis with illustrative diagrams.

AI2ML AI to Machine Learning

Apr 17, 2025

Inside Qwen: A Deep Dive into the Large Model’s Source Code

Overview

Qwen is presented as an open, multimodal series that builds on the latest papers and open‑source solutions, selecting the best components after testing.

Data preparation (iterative approach) – 3 T tokens

Deduplication using MinHash and LSH algorithms.

Quality improvement based on traditional ML ideas.

Manual sampling for validation.

Multi‑task fusion to boost zero‑shot and few‑shot performance.

Tokenization and model tweaks

Byte‑Pair Encoding (BPE) via the fast BPE tokenizer.

Rotary Positional Embedding (RoPE).

FP32 precision.

Pre‑Norm & RMSNorm for better stability at low difficulty.

SwiGLU activation (implemented as SiLU, the β=1 version of Swish).

Training configuration

Context length: 2048.

Flash Attention.

AdamW (β1=0.9, β2=0.95, ε=10⁻⁸) with learning‑rate schedule from 100 % to 10 %.

Batch size: 128.

Training framework: DeepSpeed.

Reinforcement Learning from Human Feedback (RLHF)

Preference model pre‑training (PMP).

Proximal Policy Optimization (PPO).

KL‑penalty (code and mathematics provided).

Code‑Qwen

Mixed text and code data.

Multi‑stage SFT strategy.

3 % warm‑up iteration.

Code‑Qwen details (7 B model)

Sequence length: 1024.

Training steps: 50 000.

Qwen2 improvements

Support for 30 languages.

High‑quality code, mathematics, and multilingual data.

Qwen2 MoE framework.

Context length progression: Qwen 2048 → Qwen1.5 4096 → Qwen2 32 768 tokens.

Data augmentation: rejection sampling, execution feedback, data repurposing, constitutional feedback.

RL method switched from PPO to DPO.

Attention head changed from MHA to GQA.

Scaling‑law exploration

Investigates the relationship between learning‑rate, batch size, model size, and pre‑training data volume.

Source‑code analysis: Qwen

The core structure consists of QWenModel built from QWenBlock, which contains QWenMLP and QWenAttention. QWenBlock includes two RMSNorm modules, initializes FlashSelfAttention, and uses quantize_cache_v() and apply_rotary_emb(). QWenLMHeadModel provides chat() and chat_stream() and combines lm_head with the transformer.

Qwen2 source changes

Renames the core block to DecoderLayer. Introduces group‑wise KV handling via repeat_kv in eager_attention_forward. The number of key‑value groups is computed as config.num_attention_heads // config.num_key_value_heads. MoE adds a balance loss and generic sparse‑moe blocks, including moe sdpa attention.

Qwen3 preview

Although not released, the source shows a structure similar to Qwen2. The attention module adds a sliding_window parameter, enabling a sliding‑window mechanism for ultra‑long context.

Reduces memory consumption.

Improves computational efficiency, accelerating training and inference.

Supports extremely long context modeling.

Conclusion

The author praises Qwen’s data‑quality‑first philosophy, thorough data cleaning and balancing, and its steady move toward multimodality, expressing interest in future multimodal source‑code analyses.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Large Language Model Qwen tokenization MoE RLHF model architecture source code

Written by

AI2ML AI to Machine Learning

Original articles on artificial intelligence and machine learning, deep optimization. Less is more, life is simple! Shi Chunqi

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Overview

Data preparation (iterative approach) – 3 T tokens

Tokenization and model tweaks

Training configuration

Reinforcement Learning from Human Feedback (RLHF)

Code‑Qwen

Code‑Qwen details (7 B model)

Qwen2 improvements

Scaling‑law exploration

Source‑code analysis: Qwen

Qwen2 source changes

Qwen3 preview

Conclusion

AI2ML AI to Machine Learning

How this landed with the community

Was this worth your time?

0 Comments

Data preparation (iterative approach) – 3 T tokens

Code‑Qwen details (7 B model)