Artificial Intelligence 11 min read

How RL Fine‑Tuning Turned Qwen‑7B into a Logic‑Reasoning Powerhouse

By applying a three‑stage rule‑based reinforcement learning pipeline to the open‑source Qwen‑7B model, the authors achieved a jump from 0.2 to 0.41 accuracy on a synthetic logic‑puzzle test set, while also inducing multi‑step reasoning, verification, and longer, more structured responses.

Baobao Algorithm Notes

Feb 6, 2025

How RL Fine‑Tuning Turned Qwen‑7B into a Logic‑Reasoning Powerhouse

We open‑source the full training logs and Weights & Biases (wandb) reports for a reinforcement‑learning (RL) fine‑tuning project on the Qwen‑7B base model. The code and data are available at https://github.com/Unakar/Logic-RL.

Key Results

The base Qwen‑7B model initially solved only trivial step‑by‑step logic puzzles (≈20% accuracy). After three stages of rule‑based RL (≈400 training steps) without long‑chain‑of‑thought pre‑training, the model learned to:

Mark uncertain steps for later verification (hesitation).

Explore multiple reasoning paths.

Re‑analyze previous statements.

Provide staged summaries.

Verify all statements before producing the final answer.

Occasionally think in Chinese while answering in English.

Test‑set accuracy increased from 0.20 to 0.41, surpassing the GPT‑4o benchmark of 0.30. Average response length grew from ~400 tokens to ~650 tokens due to the added verification steps.

Reward Design

Two strict rewards were used:

Format reward : penalizes any deviation from the required

<think>…</think><answer>…</answer>

markup.

Answer reward : gives a positive signal only when the final answer is correct.

The reward signal consisted solely of these components, implemented with extensive if‑else logic and regular‑expression checks. Early loopholes discovered by the model were closed by iteratively refining the rules.

Dataset Synthesis

The training set contains fewer than 2 000 synthetic logic puzzles generated programmatically to be out‑of‑distribution for Qwen‑7B. Each puzzle follows an “honest person vs. liar” structure with N participants; the task is to identify the liar. Difficulty is adjustable, and a 5‑person version was chosen because Qwen‑7B fails on it initially but can improve with RL.

Model Base Selection

Initial experiments with DeepSeek distillation models showed poor instruction following, a tendency to output Python code, and heavy markdown bias. These issues led to the selection of Qwen‑7B as the base model.

RL Training Settings

Batch size: 8.

Rollout sizes progressed from 32 to 64 and finally 16.

Four NVIDIA A100 GPUs were used.

Initial PPO training was stable but slow; the REINFORCE algorithm proved faster and more memory‑efficient.

Three‑Stage RL Procedure

Step 1: Curriculum Learning & Format Enforcement

Training started with a small 3‑person puzzle set to teach the <think> and <answer> tags. Heavy negative rewards forced format errors below 0.1% within ten training steps. Response length increased modestly (pseudo‑length growth).

Step 2: High‑Temperature Sampling & Massive Rollout

The curriculum expanded to the full 5‑person dataset. Sampling temperature was set around 1.2 (1.5 caused collapse). Top‑p and top‑k were tuned to increase token diversity and disrupt the model’s default markdown style. This stage produced emergent verification behavior even though the dataset contained no explicit verification tokens.

Step 3: Gradual Annealing

Sampling temperature was slowly annealed from 1.2 to 0.9, and the learning rate decayed to 2e-7. The model’s outputs became stable, consistently containing verification, reflection, back‑tracking, and correct formatting.

Observations & Insights

Mixed‑language thinking (Chinese internal reasoning, English answer) appears beneficial.

Response length growth is a natural consequence of added verification steps.

The learned format and verification abilities emerged without explicit supervision.

Pre‑trained models such as GPT‑4o and Claude Sonnet perform poorly on these puzzles, highlighting the advantage of targeted RL fine‑tuning.

Preliminary tests suggest the reasoning skills may transfer to benchmarks like GSM8K.

Future Work

Planned analyses include interpretability studies of checkpoint checkpoints, generation of long‑chain‑of‑thought explanations, and broader generalization experiments on additional reasoning benchmarks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Qwen-7B

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Key Results

Reward Design

Dataset Synthesis

Model Base Selection

RL Training Settings

Three‑Stage RL Procedure

Step 1: Curriculum Learning & Format Enforcement

Step 2: High‑Temperature Sampling & Massive Rollout

Step 3: Gradual Annealing

Observations & Insights

Future Work

Baobao Algorithm Notes

How this landed with the community

Was this worth your time?

0 Comments

Step 1: Curriculum Learning & Format Enforcement

Step 2: High‑Temperature Sampling & Massive Rollout

Step 3: Gradual Annealing