Can a 7B Language Model Solve Sudoku with Reinforcement Learning? Findings and Lessons
This article details a reinforcement‑learning experiment that teaches 7B‑ and 3B‑parameter language models to solve Sudoku, covering data preparation, GRPO‑based reward design, training configurations, performance comparisons, key insights, and future research directions.
Background
Recent advances show that large language models (LLMs) can perform many tasks, but teaching them to handle structured, spatial‑reasoning problems like Sudoku remains challenging. Hrishbh Dalal explored whether a 7B‑parameter model could learn Sudoku purely through reinforcement learning (RL) using DeepSeek’s GRPO algorithm.
Data Preparation
The experiment used a Kaggle Sudoku dataset containing 4 million puzzles of varying difficulty. The preparation pipeline involved:
Loading and filtering the dataset with kagglehub based on difficulty.
Classifying puzzles into four levels: Level 1 (50‑81 clues), Level 2 (40‑49 clues), Level 3 (30‑39 clues), Level 4 (17‑29 clues).
Representing each puzzle as an 81‑character string and converting it into a grid format with proper row, column, and block delimiters.
Designing prompts that wrap each puzzle in <think> and <answer> tags to guide the model’s reasoning and final answer.
A focused subset of 400 easy puzzles was created for the initial training to keep the context length within the 3000‑token limit of a 24 GB RTX 4090.
Experimental Setup
Two model variants were fine‑tuned with LoRA:
Qwen 2.5 7B Instruct (LoRA rank 16)
Qwen 2.5 3B Instruct (LoRA rank 32)
Training hyper‑parameters:
Batch size: 1
Gradient accumulation steps: 8
Learning rate: 3e‑4 (Karpathy constant)
Maximum rollout steps: 500
Evaluation every 10 steps
Maximum sequence length: 3000 tokens
Reward Design
The RL reward system consisted of multiple components:
1. Format Compliance Reward
Two functions encourage the use of required tags and correct ordering: tags_presence_reward_func gives credit for each required tag. tags_order_reward_func ensures <think> appears before <answer>.
2. Grid Structure Reward
Evaluates whether the model’s answer maintains proper Sudoku grid formatting, rewarding correct rows, separators, and overall layout.
3. Answer Accuracy Reward
Two functions assess solution correctness: exact_answer_reward_func grants a large reward (5.0) for a completely correct grid. simple_robust_partial_reward_function provides partial credit for preserving original clues and correctly filling individual cells, offering smoother gradients.
4. Rule‑Compliance Reward
Checks each row, column, and 3×3 block for duplicate numbers; satisfying each constraint yields additional reward.
Results
The 7B model demonstrated stable learning:
Consistent rollout length around 1000 tokens.
Generated correctly formatted solutions.
Steady increase in reward metrics.
Stable training dynamics throughout.
In contrast, the 3B model suffered catastrophic instability:
Frequent KL‑divergence spikes up to 80.
Severe performance oscillations and eventual collapse.
Charts (included in the original article) show the 7B curve remaining smooth while the 3B curve diverges dramatically.
Reward trajectories indicate that the 7B model learned to produce exact solutions quickly, whereas the 3B model failed to maintain learning stability.
Key Insights
There appears to be a minimum model scale required for complex reasoning tasks; sub‑threshold models may fail to learn.
Stable training dynamics are a prerequisite for successful learning.
Multi‑component rewards guide the model more effectively than binary success/failure signals.
RL can teach LLMs structured thinking, even though they are originally trained for next‑token prediction.
Future Work
Planned extensions include:
Increasing puzzle difficulty to test reasoning limits.
Scaling up compute resources for longer training and larger batches.
Exploring higher LoRA ranks (e.g., 32 for the 7B model).
Distilling cold‑start data from larger models like DeepSeek R1.
Implementing more sophisticated reward functions that incorporate difficulty weighting, progressive thresholds, and minimum reward floors.
Developing richer evaluation metrics that assess reasoning quality beyond mere correctness.
Broader Implications
Teaching LLMs to solve Sudoku demonstrates capabilities that extend to other domains requiring structured processes, step‑by‑step logic, format consistency, rule compliance, and spatial reasoning, such as code generation, mathematical problem solving, scientific reasoning, and formal verification.
Overall, the experiment shows that a 7B LLM can quickly learn Sudoku with limited data when equipped with a well‑designed multi‑component RL reward system, while smaller models may lack the capacity to achieve stable learning.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
