How Hierarchical Sampling Boosts Self‑Taught Reasoning in LLMs
HS‑STAR introduces a three‑stage hierarchical sampling framework that identifies high‑utility boundary problems, reallocates computation budget to them, and fine‑tunes large language models, achieving significant accuracy gains on math reasoning benchmarks without extra sampling cost.
Project Overview
HS‑STAR (Hierarchical Sampling for Self‑Taught Reasoners) is a three‑stage framework that improves mathematical reasoning of large language models by allocating computation budget according to problem difficulty.
Motivation
Existing self‑taught reasoning methods treat all training questions equally, wasting resources on overly easy or overly hard examples. Empirical studies show that “boundary” problems—those near the model’s capability limit—provide the highest learning utility.
Method
The framework consists of:
Stage 1 – Difficulty Estimation: A lightweight Reward‑Guided Difficulty Estimation (RDE) uses a small pre‑sampling step and evaluates responses by answer correctness and process reward, classifying questions into Inlier, Outlier, or Boundary.
Stage 2 – Re‑Sampling: The remaining budget is re‑allocated exclusively to the identified Boundary questions, generating additional high‑quality responses.
Stage 3 – Preference Optimization: All collected responses are paired as correct/incorrect, ranked by a preference reward model, and used to fine‑tune the LLM via direct preference optimization.
Experimental Results
Across several math reasoning benchmarks (DeepSeek‑Math‑7B, Qwen2.5‑3B, Qwen2.5‑7B) HS‑STAR outperforms all baselines, improving average accuracy by 1.4‑2.2 %. It also achieves state‑of‑the‑art performance on high‑difficulty datasets such as AIME‑24 and AMC‑23. Zero‑training experiments show comparable results to online RL methods while avoiding their complexity.
Difficulty estimation achieves >70 % accuracy in classifying Inlier/Outlier/Boundary samples across three iterative rounds and generalizes to multiple models, even under zero‑training conditions.
Ablation studies confirm that combining accuracy and reward signals in RDE yields the best performance, and that focusing re‑sampling solely on Boundary questions provides the highest gains.
Future Directions
Extending HS‑STAR beyond mathematics to tasks with ambiguous difficulty definitions and integrating the hierarchical sampling strategy with online reinforcement learning are identified as promising research avenues.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Amap Tech
Official Amap technology account showcasing all of Amap's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
