Artificial Intelligence 17 min read

Boosting Large Language Model Math Reasoning: Mixed Instructions, Synthetic Data, and Training Optimizations

This article presents a comprehensive technical walkthrough on enhancing large language model mathematical reasoning by reviewing model architectures, introducing mixed CoT‑PoT instructions, generating and filtering synthetic data, and applying multi‑stage training optimizations such as RFT, PPO, and DPO, with detailed experimental results and Q&A insights.

NewBeeNLP

Sep 2, 2024

Boosting Large Language Model Math Reasoning: Mixed Instructions, Synthetic Data, and Training Optimizations

1. Large Language Model Overview

The presentation begins with a review of mainstream large language models (LLMs) such as GPT‑3 (175B parameters), Bloom, GLM series, LLaMA, Baichuan, and Qwen. Most adopt a GPT‑style transformer architecture, with variations in parameter counts (e.g., GLM 6B/10B/130B, LLaMA 7B/13B/33B, Qwen 7B/14B/110B). Key architectural components include token vocabularies, transformer layers, multi‑head attention, and feed‑forward networks. Parameter calculations for GPT‑2 (1.3B) illustrate how embedding dimensions and layer counts determine model size.

Optimization techniques for attention mechanisms such as FlashAttention, SparseAttention, MAQ, and GQA are highlighted, along with positional encoding methods (absolute and RoPE).

2. Mixed Instructions (CoT + PoT)

Mathematical problems are split into logical reasoning (handled by Chain‑of‑Thought, CoT) and computational tasks (handled by Program‑of‑Thought, PoT). CoT excels at step‑by‑step reasoning but struggles with complex calculations like integrals or equation solving, where PoT achieves higher accuracy. Conversely, PoT alone lacks transparent reasoning for abstract algebra or geometry. The proposed mixed instruction combines CoT for reasoning and PoT for final computation, mitigating the weaknesses of each approach.

3. Synthetic Data Generation

Because high‑quality math instruction data are scarce, the authors employ a two‑stage synthetic data pipeline. Seed tasks (curated high‑quality math questions) are expanded using Self‑Instruct techniques. Problems are categorized by sub‑domains (matrix operations, calculus, equations, etc.) and then automatically generated. Simple similarity metrics such as longest common subsequence or Jaccard distance are used for initial filtering.

Reward Model and Critique Model are trained to score generated instructions and answers. The Reward Model provides relative ranking, while the Critique Model offers absolute scoring to ensure both instruction quality and answer correctness.

4. Training Optimization

The training workflow consists of two stages: RFT (Refusal‑Free Training) and Reinforcement Learning (RLHF) . In the RFT stage, a smaller LLaMA model is fine‑tuned (SFT) to generate multiple diverse reasoning paths for each problem. These paths are filtered by Reward and Critique models for quality and diversity, then used to train larger models.

Experiments show that smaller models (7B, 14B) contribute disproportionately to diverse reasoning paths compared to larger 33B models, reducing data generation cost. Diversity selection is performed by measuring similarity between reasoning paths and retaining those with maximal distance.

Subsequent reinforcement phases explore PPO and DPO. DPO pairs high‑scoring (e.g., 9‑point) and low‑scoring (e.g., 2‑point) answers from RFT, but gains are modest because DPO struggles on hard samples where answer quality differences are subtle. Optimizations include focusing DPO on hard samples (multiple reasoning paths with low Critique scores) and applying dynamic loss weighting to the PoT component.

5. Evaluation Results

On a held‑out test set, SFT achieves 71% accuracy, while RFT improves to 77%. DPO adds only a slight edge, with a 17% win rate versus a 10% loss rate, indicating limited impact on difficult problems. Hard samples—defined as questions where multiple reasoning paths receive low Critique scores—require more training steps to converge.

6. Q&A Highlights

Key questions addressed include the distinction between PPO and DPO (different loss formulations and data pairing strategies), the scalability limits of synthetic data (dependent on filtering quality), and model size considerations for Critique versus Reward models (Critique remains lightweight, Reward can be larger).

Overall, the work demonstrates that treating mathematical reasoning as a core general capability of LLMs, and systematically improving it through mixed instructions, synthetic data, and multi‑stage training, yields measurable performance gains.