Can a 7B LLM Master Sudoku From Scratch Using Reinforcement Learning?

This article details how a 7B parameter language model, fine‑tuned with DeepSeek's GRPO reinforcement‑learning algorithm and a carefully crafted multi‑component reward system, learned to solve Sudoku puzzles without any cold‑start data, outperforming a comparable 3B model and revealing key insights for structured reasoning tasks.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
Can a 7B LLM Master Sudoku From Scratch Using Reinforcement Learning?

Background and Goal

The author investigates whether a large language model (LLM) can acquire the ability to solve Sudoku—a task requiring strict rule adherence, grid formatting, logical and spatial reasoning—using only reinforcement learning (RL) and no cold‑start data.

Challenges of Sudoku for LLMs

Strict rule set: each row, column, and 3×3 box must contain digits 1‑9 without repetition.

Need to maintain a consistent grid format.

Requires step‑by‑step logical inference.

Must understand spatial relationships between cells.

Must produce a correct final solution.

Data Preparation

A 4‑million‑puzzle Kaggle Sudoku dataset was filtered by difficulty:

Level 1 (very easy): 50‑81 clues

Level 2 (easy): 40‑49 clues

Level 3 (medium): 30‑39 clues

Level 4 (hard): 17‑29 clues

Each puzzle was represented as an 81‑character string and converted into a grid with appropriate row/column separators (see image). Prompt engineering wrapped each puzzle in a <think> tag for stepwise reasoning and a <answer> tag for the final grid.

Training Setup

A focused subset of 400 easy puzzles formed the training set to stay within GPU memory limits (RTX 4090 24 GB). Two model variants were fine‑tuned:

Qwen 2.5 7B Instruct with LoRA rank 16

Qwen 2.5 3B Instruct with LoRA rank 32

Key hyper‑parameters:

Batch size: 1

Gradient accumulation steps: 8

Learning rate: 3e‑4 (Karpathy constant)

Maximum training steps: 500

Evaluation every 10 steps

Maximum sequence length: 3000 tokens

Reward Design

The reward function comprised several components:

1. Format‑Compliance Reward

Ensures the model uses the required <think> and <answer> tags and places them in the correct order.

2. Grid‑Structure Reward

Evaluates correct row count, proper separators, and overall grid layout (see diagram).

3. Answer‑Accuracy Rewards

Exact answer reward (value 5.0) for a completely correct solution.

Partial robust reward that grants credit for preserving original clues and for each correctly filled cell.

4. Rule‑Compliance Reward

Penalizes duplicate digits in any row, column, or 3×3 box, granting incremental credit for each satisfied constraint.

These components together guide the model to separate reasoning from the final answer and to respect Sudoku’s structural constraints.

Results

The 7B model demonstrated stable learning:

Consistent completion length (~1000 tokens)

Uniformly formatted answers

Steady increase in all reward metrics

In contrast, the 3B model suffered catastrophic instability (KL divergence spiked to 80), inconsistent performance, and eventually crashed.

Charts illustrate the divergent trajectories, with the pink line (7B) showing smooth progress and the green line (3B) displaying erratic spikes and failure.

Key Insights

Model size matters: a minimum scale is required for stable learning on complex reasoning tasks.

Training stability is a prerequisite for progress; the 7B model’s steady dynamics enabled incremental gains.

Multi‑component rewards provide richer guidance than binary success/failure signals.

RL can teach LLMs structured thinking, enabling them to follow strict formats and logical steps.

Future Directions

Planned next steps include:

Increasing puzzle difficulty to test deeper reasoning.

Scaling compute (larger batches, longer training).

Exploring higher LoRA ranks (e.g., 32 for the 7B model).

Distilling cold‑start data from larger models like DeepSeek R1.

Implementing more sophisticated reward functions (difficulty‑aware scaling, milestone bonuses, minimum reward floor).

Developing richer evaluation metrics beyond solution correctness.

Broader Implications

Teaching LLMs to solve Sudoku showcases how RL can endow models with capabilities useful for:

Code generation with strict syntax.

Step‑by‑step mathematical problem solving.

Scientific reasoning and methodical hypothesis testing.

Formal verification tasks that require rule compliance.

These abilities extend far beyond puzzle solving, pointing toward more reliable AI systems for structured, logical tasks.

https://hrishbh.com/teaching-language-models-to-solve-sudoku-through-reinforcement-learning/
QwenGRPOAI trainingsudoku
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.