Artificial Intelligence 14 min read

Can Reasoning Models Keep Improving? TEMPO Uses EM to Stop Reward Drift

The paper introduces TEMPO, a test‑time training framework inspired by the Expectation‑Maximization algorithm, which alternates policy optimization (M‑step) with Critic calibration (E‑step) to prevent reward‑signal drift, and demonstrates on Qwen3 and OLMO3 models that it continuously improves reasoning performance and maintains output diversity beyond the saturation point of existing TTT methods.

Machine Learning Algorithms & Natural Language Processing

Apr 28, 2026

Can Reasoning Models Keep Improving? TEMPO Uses EM to Stop Reward Drift

Background

Large language models (LLMs) have fixed parameters after pre‑training, so during inference they can only apply the learned policy and cannot adapt from new test‑time data.

Test‑time Training and Reward Drift

Test‑time training (TTT) aims to overcome this limitation by updating the model on unlabeled test inputs. Existing TTT methods such as TTRL, EMPO and Intuitor initially improve performance but soon plateau and suffer a drop in output diversity because their reward signals are derived from the model’s own output distribution (majority vote or semantic consistency). This creates a feedback loop that biases the model toward its current dominant reasoning path, causing reward‑signal drift.

EM Perspective on Reward Drift

From an Expectation‑Maximization (EM) viewpoint, current TTT methods omit the E‑step, so the auxiliary distribution diverges from the true posterior. Consequently the evidence lower bound (ELBO) becomes loose and optimization drifts away from the real objective.

TEMPO Framework

TEMPO (Test‑time Expectation‑Maximization Policy Optimization) alternates two steps:

E‑step (Critic calibration): With the policy fixed, a Critic model trained on labeled data estimates the posterior probability that each generated answer is correct, anchoring the reward to true correctness.

M‑step (Policy optimization): With the auxiliary distribution fixed, the policy is updated to maximize the ELBO using the Critic’s scores as external rewards and token‑wise baseline predictions to compute the advantage function.

The same Critic is reused for both steps and is periodically re‑trained on fresh labeled data to prevent drift.

Algorithm Details

In each iteration, TEMPO first samples unlabeled test questions, updates the Critic (E‑step), then samples test questions again and updates the policy (M‑step). The optimization objective combines (1) the expected log‑likelihood of correct answers and (2) a KL term that pushes the auxiliary distribution toward the true posterior.

Experiments

Mathematical reasoning experiments use Qwen3‑8B, Qwen3‑14B and OLMO3‑7B initialized with standard RLVR (PPO) on DAPO‑Math‑17K. Test‑time training is performed on AIME 2024, AIME 2025 and Beyond AIME test sets. General reasoning experiments initialize on Dolci‑RL‑Zero‑General and evaluate on BigBenchHard, AGIEval and ZebraLogic. Metrics include avg@16 and pass@k for math tasks, and pass@1 for general tasks. Baselines are Zero‑RL (PPO), TTRL and EMPO.

Main Results

TEMPO consistently lifts performance beyond the saturation point of baselines. Qwen3‑14B’s accuracy on AIME 2024 rises from 42.3 % to 65.8 % (+23.5 pp), and OLMO3‑7B improves from 33.0 % to 51.1 % (+18.1 pp). Pass@k remains high: on Qwen3‑14B, pass@8 increases from 56.7 % to 73.3 %; on OLMO3‑7B, pass@8 goes from 45.8 % to 60.0 %. All three models keep improving for up to 350 test‑time steps, while TTRL and EMPO plateau early.

Ablation Studies

Continuing PPO training on the same labeled data yields almost no gain, confirming that new test questions are needed for further improvement. Removing the periodic E‑step (i.e., fixing the Critic after a single pre‑training) matches TEMPO’s early gains but soon plateaus, demonstrating that regular Critic recalibration is essential for sustained progress.

Conclusion

TEMPO shows that reasoning models can keep getting stronger during inference when the reward signal is periodically anchored to true correctness via an EM‑style E‑step. The method outperforms prior TTT approaches, maintains output diversity, and opens future research directions such as reducing dependence on labeled data, extending to agentic tasks, and providing theoretical convergence guarantees.

Large Language Models reasoning reinforcement learning EM algorithm test-time training reward drift

Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.