Machine Learning Algorithms & Natural Language Processing
Apr 28, 2026 · Artificial Intelligence
Can Reasoning Models Keep Improving? TEMPO Uses EM to Stop Reward Drift
The paper introduces TEMPO, a test‑time training framework inspired by the Expectation‑Maximization algorithm, which alternates policy optimization (M‑step) with Critic calibration (E‑step) to prevent reward‑signal drift, and demonstrates on Qwen3 and OLMO3 models that it continuously improves reasoning performance and maintains output diversity beyond the saturation point of existing TTT methods.
EM algorithmlarge language modelsreasoning
0 likes · 14 min read
