How LightReasoner Lets Small Models Teach Large Models to Reason Efficiently

The LightReasoner paper from Hong Kong University shows that small language models can guide large models on critical reasoning steps, achieving up to 90% faster inference and significant accuracy gains across multiple math benchmarks.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
How LightReasoner Lets Small Models Teach Large Models to Reason Efficiently

Hong Kong University proposes LightReasoner, a framework where small models teach large models "key reasoning" steps, boosting efficiency by 90%.

Origin: Contrastive Decoding (CD)

LightReasoner builds on Contrastive Decoding, which uses a small model to compare against a large model at each inference step, highlighting the large model's strengths but suffers from low efficiency, lack of step selection, and dependence on large model size differences.

Revival: Three Breakthroughs

From manual intervention to autonomous learning: models learn to reinforce their own advantages.

From treating all tokens equally to focusing on critical steps.

From relying on scale differences to exploiting specialized capability differences, e.g., Qwen2.5 family models.

Core Idea: Post‑training "Third Path"

Instead of "full review" or "error correction", LightReasoner adopts a third approach—"advantage mining"—which strengthens already‑strong reasoning steps rather than re‑teaching everything.

Technical Insight: KL‑Divergence Bottlenecks

By measuring KL divergence between expert and amateur models at each token, the authors find that only ~20% of tokens show high divergence, corresponding to arithmetic or logical pivots, while ~60% have near‑zero divergence, indicating redundancy.

Method: Advantage Distillation

Stage 1 – Contrastive Sampling

Identify informative steps using a KL‑threshold β.

Construct contrastive labels by filtering low‑probability tail tokens and computing log(π_expert) - log(π_amateur) as a score of expert advantage.

Stage 2 – Self‑Distillation

Minimize KL divergence between the expert model’s output distribution and the contrastive target, encouraging higher confidence on advantageous tokens while suppressing predictions similar to the amateur model.

Analogy: Instead of replaying an entire chess game, focus on the decisive moves where the amateur errs and the master excels, then practice those critical moves.

Experimental Results

Across 7 math reasoning benchmarks and 5 model families, LightReasoner consistently improves accuracy (e.g., +28.1% GSM8K on Qwen2.5‑Math‑1.5B) and reduces inference cost dramatically (90% less time, 80% fewer sampled tokens, 99% fewer training tokens).

Even models that have undergone extensive instruction tuning still see stable gains, and training on GSM8K alone generalizes to MATH, SVAMP, Minerva Math, and MMLU STEM, indicating learned generic reasoning ability.

Implications

Shifts focus from sheer scale to specialized knowledge differences.

Moves supervision from static answer labels to dynamic behavior comparison.

Opens the path toward collaborative model ecosystems where models teach each other.

LightReasoner’s code is open‑source (https://github.com/HKUDS/LightReasoner), inviting further exploration of this efficient, behavior‑driven training paradigm.

large language modelsmodel distillationKL DivergenceMathematical ReasoningContrastive Decoding
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.