reward drift — 1 Technical Articles

Machine Learning Algorithms & Natural Language Processing

Apr 28, 2026 · Artificial Intelligence

Can Reasoning Models Keep Improving? TEMPO Uses EM to Stop Reward Drift

The paper introduces TEMPO, a test‑time training framework inspired by the Expectation‑Maximization algorithm, which alternates policy optimization (M‑step) with Critic calibration (E‑step) to prevent reward‑signal drift, and demonstrates on Qwen3 and OLMO3 models that it continuously improves reasoning performance and maintains output diversity beyond the saturation point of existing TTT methods.

EM algorithmlarge language modelsreasoning

0 likes · 14 min read

Can Reasoning Models Keep Improving? TEMPO Uses EM to Stop Reward Drift