Can a 7B LLM Master Sudoku From Scratch Using Reinforcement Learning?
This article details how a 7B parameter language model, fine‑tuned with DeepSeek's GRPO reinforcement‑learning algorithm and a carefully crafted multi‑component reward system, learned to solve Sudoku puzzles without any cold‑start data, outperforming a comparable 3B model and revealing key insights for structured reasoning tasks.
Background and Goal
The author investigates whether a large language model (LLM) can acquire the ability to solve Sudoku—a task requiring strict rule adherence, grid formatting, logical and spatial reasoning—using only reinforcement learning (RL) and no cold‑start data.
Challenges of Sudoku for LLMs
Strict rule set: each row, column, and 3×3 box must contain digits 1‑9 without repetition.
Need to maintain a consistent grid format.
Requires step‑by‑step logical inference.
Must understand spatial relationships between cells.
Must produce a correct final solution.
Data Preparation
A 4‑million‑puzzle Kaggle Sudoku dataset was filtered by difficulty:
Level 1 (very easy): 50‑81 clues
Level 2 (easy): 40‑49 clues
Level 3 (medium): 30‑39 clues
Level 4 (hard): 17‑29 clues
Each puzzle was represented as an 81‑character string and converted into a grid with appropriate row/column separators (see image). Prompt engineering wrapped each puzzle in a <think> tag for stepwise reasoning and a <answer> tag for the final grid.
Training Setup
A focused subset of 400 easy puzzles formed the training set to stay within GPU memory limits (RTX 4090 24 GB). Two model variants were fine‑tuned:
Qwen 2.5 7B Instruct with LoRA rank 16
Qwen 2.5 3B Instruct with LoRA rank 32
Key hyper‑parameters:
Batch size: 1
Gradient accumulation steps: 8
Learning rate: 3e‑4 (Karpathy constant)
Maximum training steps: 500
Evaluation every 10 steps
Maximum sequence length: 3000 tokens
Reward Design
The reward function comprised several components:
1. Format‑Compliance Reward
Ensures the model uses the required <think> and <answer> tags and places them in the correct order.
2. Grid‑Structure Reward
Evaluates correct row count, proper separators, and overall grid layout (see diagram).
3. Answer‑Accuracy Rewards
Exact answer reward (value 5.0) for a completely correct solution.
Partial robust reward that grants credit for preserving original clues and for each correctly filled cell.
4. Rule‑Compliance Reward
Penalizes duplicate digits in any row, column, or 3×3 box, granting incremental credit for each satisfied constraint.
These components together guide the model to separate reasoning from the final answer and to respect Sudoku’s structural constraints.
Results
The 7B model demonstrated stable learning:
Consistent completion length (~1000 tokens)
Uniformly formatted answers
Steady increase in all reward metrics
In contrast, the 3B model suffered catastrophic instability (KL divergence spiked to 80), inconsistent performance, and eventually crashed.
Charts illustrate the divergent trajectories, with the pink line (7B) showing smooth progress and the green line (3B) displaying erratic spikes and failure.
Key Insights
Model size matters: a minimum scale is required for stable learning on complex reasoning tasks.
Training stability is a prerequisite for progress; the 7B model’s steady dynamics enabled incremental gains.
Multi‑component rewards provide richer guidance than binary success/failure signals.
RL can teach LLMs structured thinking, enabling them to follow strict formats and logical steps.
Future Directions
Planned next steps include:
Increasing puzzle difficulty to test deeper reasoning.
Scaling compute (larger batches, longer training).
Exploring higher LoRA ranks (e.g., 32 for the 7B model).
Distilling cold‑start data from larger models like DeepSeek R1.
Implementing more sophisticated reward functions (difficulty‑aware scaling, milestone bonuses, minimum reward floor).
Developing richer evaluation metrics beyond solution correctness.
Broader Implications
Teaching LLMs to solve Sudoku showcases how RL can endow models with capabilities useful for:
Code generation with strict syntax.
Step‑by‑step mathematical problem solving.
Scientific reasoning and methodical hypothesis testing.
Formal verification tasks that require rule compliance.
These abilities extend far beyond puzzle solving, pointing toward more reliable AI systems for structured, logical tasks.
https://hrishbh.com/teaching-language-models-to-solve-sudoku-through-reinforcement-learning/Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
