Can LLMs Boost Reasoning Alone? Introducing SePT’s Simple Online Self‑Training
SePT (Self‑evolving Post‑Training) shows that a large language model can improve its mathematical reasoning ability by about ten percentage points using a reward‑free online self‑training loop that decouples generation temperature from standard SFT, matching or surpassing RL‑based methods without harming general performance.
SePT Overview
SePT (Self‑evolving Post‑Training) is a reward‑free self‑training framework for large language models that improves reasoning ability by iteratively generating its own training data and fine‑tuning on it.
Online self‑training loop
Sample a question from a problem pool and generate an answer with sampling temperature τ_s.
Treat the (question, answer) pairs as a supervised fine‑tuning (SFT) dataset and update the model using the negative log‑likelihood loss.
Use the updated model to generate the next round of data.
Only standard SFT is used; no reward model, verifier, or teacher signal is introduced.
Mathematical reasoning results
Experiments on six math benchmark suites using Qwen2.5‑Math‑7B and DeepSeek‑Math‑7B‑Instruct compare SePT against a strong baseline that applies the best temperature sweep without post‑training. SePT raises Pass@1, Pass@8, Pass@32 and average (AVG) scores by roughly ten points. Compared with the RL‑based method GRPO (RLVR), SePT attains comparable performance: on the OpenThoughts‑Math (OTM) dataset AVG 55.2 vs. 56.6 for GRPO, and Pass@1 40.8 vs. 39.5 for Qwen2.5‑Math‑7B.
Ablation and temperature decoupling
SePT Offline (training on a fixed dataset) drops AVG to 45.5, confirming the importance of online data generation (AVG 55.0). Coupling generation temperature with training temperature (both = τ_s) yields Pass@1 19.3 and AVG 44.6. Decoupling (low‑temp generation, training temperature = 1) improves to Pass@1 39.5 and AVG 55.0.
Theoretical justification
Theorem 1 shows that for any token pair the pairwise logit margin after one SFT step is amplified by a factor proportional to the generation temperature τ_s when the training temperature is fixed at 1. Consequently, low‑temperature sampling preserves token ordering while expanding the pre‑training preference boundaries, which leads to better reasoning performance.
Impact on general‑domain ability
Evaluations on IFEval, BBH, GPQA, MuSR and MMLU‑Pro show negligible degradation: SePT scores 23.6/47.3/30.6/41.5/32.2 versus the base model 23.4/47.5/29.9/41.4/32.1, with slight improvements on most tasks.
Implementation details
Code is released at https://github.com/ElementQi/SePT. The implementation builds on ByteDance’s open‑source verl framework but the training loop is framework‑agnostic and lightweight: each prompt is sampled once during generation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
