Artificial Intelligence 9 min read

Can LLMs Boost Reasoning Alone? Introducing SePT’s Simple Online Self‑Training

SePT (Self‑evolving Post‑Training) shows that a large language model can improve its mathematical reasoning ability by about ten percentage points using a reward‑free online self‑training loop that decouples generation temperature from standard SFT, matching or surpassing RL‑based methods without harming general performance.

Machine Heart

Apr 22, 2026

Can LLMs Boost Reasoning Alone? Introducing SePT’s Simple Online Self‑Training

SePT Overview

SePT (Self‑evolving Post‑Training) is a reward‑free self‑training framework for large language models that improves reasoning ability by iteratively generating its own training data and fine‑tuning on it.

Online self‑training loop

Sample a question from a problem pool and generate an answer with sampling temperature τ_s.

Treat the (question, answer) pairs as a supervised fine‑tuning (SFT) dataset and update the model using the negative log‑likelihood loss.

Use the updated model to generate the next round of data.

Only standard SFT is used; no reward model, verifier, or teacher signal is introduced.

Mathematical reasoning results

Experiments on six math benchmark suites using Qwen2.5‑Math‑7B and DeepSeek‑Math‑7B‑Instruct compare SePT against a strong baseline that applies the best temperature sweep without post‑training. SePT raises Pass@1, Pass@8, Pass@32 and average (AVG) scores by roughly ten points. Compared with the RL‑based method GRPO (RLVR), SePT attains comparable performance: on the OpenThoughts‑Math (OTM) dataset AVG 55.2 vs. 56.6 for GRPO, and Pass@1 40.8 vs. 39.5 for Qwen2.5‑Math‑7B.

Ablation and temperature decoupling

SePT Offline (training on a fixed dataset) drops AVG to 45.5, confirming the importance of online data generation (AVG 55.0). Coupling generation temperature with training temperature (both = τ_s) yields Pass@1 19.3 and AVG 44.6. Decoupling (low‑temp generation, training temperature = 1) improves to Pass@1 39.5 and AVG 55.0.

Theoretical justification

Theorem 1 shows that for any token pair the pairwise logit margin after one SFT step is amplified by a factor proportional to the generation temperature τ_s when the training temperature is fixed at 1. Consequently, low‑temperature sampling preserves token ordering while expanding the pre‑training preference boundaries, which leads to better reasoning performance.

Impact on general‑domain ability

Evaluations on IFEval, BBH, GPQA, MuSR and MMLU‑Pro show negligible degradation: SePT scores 23.6/47.3/30.6/41.5/32.2 versus the base model 23.4/47.5/29.9/41.4/32.1, with slight improvements on most tasks.

Implementation details

Code is released at https://github.com/ElementQi/SePT. The implementation builds on ByteDance’s open‑source verl framework but the training loop is framework‑agnostic and lightweight: each prompt is sampled once during generation.