12 min read

Beyond Token Entropy: ReLaX Uses Latent Dynamics to Rethink Exploration‑Exploitation in LLM RL

The paper introduces ReLaX, a framework that shifts focus from token‑level entropy to the latent‑space dynamics of large models, employing Koopman operators and a Dynamic Spectral Divergence metric to quantitatively guide exploration‑exploitation balance, and demonstrates state‑of‑the‑art performance on both pure‑text and multimodal RL benchmarks.

Machine Heart

Apr 3, 2026

Beyond Token Entropy: ReLaX Uses Latent Dynamics to Rethink Exploration‑Exploitation in LLM RL

Reinforcement learning for large models suffers from policy distribution collapse: as training proceeds the policy entropy drops, exploration decays and performance plateaus. Token‑level entropy regularization only manipulates a compressed view of the hidden states and therefore cannot fully restore exploration.

Latent‑Space Perspective

The hidden states of a transformer evolve continuously in a high‑dimensional latent space, carrying the true computation logic of inference. Exploration‑exploitation imbalance is therefore rooted in the dynamics of this latent space rather than in the token distribution.

ReLaX Framework

ReLaX (Reasoning with Latent eXploration) treats the hidden‑state evolution as a stochastic dynamical system. Random perturbations such as temperature, top‑p or top‑k are interpreted as ripples that push the latent trajectory away from its original path. Instead of directly increasing token diversity, ReLaX explicitly regulates the latent‑space dynamics during policy optimization.

Dynamic Spectral Divergence (DSD)

To quantify latent dynamics, ReLaX adopts the Koopman operator, which linearizes nonlinear evolution in a function space. A ResKoopNet MLP learns a Koopman dictionary that maps the final‑layer hidden states to a tractable linear space. In this space the Dynamic Spectral Divergence (DSD) metric is defined as the variance of spectral mode lengths along a trajectory. Higher DSD indicates richer heterogeneous dynamics and greater potential for diverse latent reasoning paths.

DSD‑Guided Strategy Optimization

DSD is incorporated into the GRPO algorithm through two mechanisms:

Advantage Shaping : a regularization term tied to positive advantage values, increasing latent flexibility only on trajectories that yield forward gain, thus avoiding semantic drift.

Adaptive KL Regularization : penalizes trajectories whose DSD exceeds a threshold to keep dynamics stable, while preserving exploration space for promising trajectories.

This yields a dynamic balance: training remains stable while latent computation explores richer paths.

Experimental Validation

ReLaX was evaluated on pure‑text LLMs and multimodal vision‑language models (VLMs) at 3B and 7B scales, comparing against the baseline GRPO.

Training curves show that GRPO’s policy entropy drops quickly, leading to sub‑optimal convergence, whereas ReLaX maintains steady performance gains and stable entropy throughout training.

On the Qwen2.5‑VL‑Instruct family, ReLaX‑7B achieves a mean@1 of 53.2% across seven multimodal benchmarks (MathVista, MathVerse, MathVision, MMMU, MMStar, DynaMath, EMMA), surpassing same‑scale baselines. The 3B variant matches or exceeds several 7B models.

For pure‑text mathematical reasoning, ReLaX consistently outperforms token‑entropy methods on six benchmarks (Math500, Minerva, AMC22/23, AIME24/25, etc.). Extending to Llama‑3.2‑Instruct and Qwen‑3‑base confirms scalability across architectures.

Comparisons with two token‑level families—Entropy‑Reg (direct entropy reward) and KL‑Cov (covariance‑based entropy control)—show that Entropy‑Reg offers no gain and can cause semantic drift, while KL‑Cov improves math‑heavy tasks but lags behind ReLaX on vision‑heavy benchmarks (e.g., EMMA‑Physics, +7.7%). The authors attribute this gap to the limited influence of token‑level perturbations on latent cross‑modal computation.

Future Outlook

Steering latent‑space dynamics provides a principled way to balance exploration and exploitation and offers a new lens for understanding large‑model reasoning.

Paper: https://arxiv.org/abs/2512.07558

Open‑source weights: https://huggingface.co/collections/SteveZ25/relax-checkpoints

GitHub repository: https://github.com/ZhangShimin1/ReLaX

large language models reinforcement learning dynamic spectral divergence Koopman operator latent space exploration ReLaX

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.