Latent Action RL Shrinks Exploration Space for Multimodal Dialogue Fine‑Tuning

By learning a compact latent‑action space from paired image‑text and large‑scale text data, the authors reduce the RL search space from a vocabulary of over 150 k tokens to a 128‑codebook, enabling more efficient fine‑tuning of multimodal conversational agents and achieving consistent gains across several RL algorithms.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Latent Action RL Shrinks Exploration Space for Multimodal Dialogue Fine‑Tuning

Vision‑Language Models (VLMs) have become the backbone of multimodal conversational agents (MCAs), but fine‑tuning them with reinforcement learning (RL) faces an exponential action‑space problem: with a vocabulary size |V| and maximum response length m, the sampling space grows as V^m.

To mitigate this, the paper proposes constructing a compact latent‑action space that compresses the per‑step search from |V| (e.g., Qwen2.5‑VL’s 152 k tokens) to a codebook of size |C| (e.g., 128), achieving orders‑of‑magnitude reduction.

The latent space is built by jointly exploiting paired image‑text data and massive pure‑text corpora. A cross‑modal projector P maps text embeddings to joint image‑text embeddings, while an inverse projector P' maps back to text embeddings. Training proceeds in two steps: (1) initialize P and P' on paired data; (2) further train both on paired data plus pure‑text data using a novel cycle‑consistency loss.

Model design comprises four modules: (1) Policy Model – predicts the latent action for the current step from the current text and image; (2) Inverse Dynamics Model – infers the latent action from a future step (used only for learning); (3) Language World Model – predicts the next token given the current observation and latent action; (4) Latent Action Codebook – a set of N vectors. Training is split into two stages. Stage 1 (Inverse Dynamics Learning) jointly trains the Inverse Dynamics Model, Language World Model, and the codebook by inferring a discrete latent action from future observations and reconstructing the next token. Stage 2 (Policy Behavior Cloning) trains the Policy Model to output latent actions that match those inferred in Stage 1.

During downstream RL, the Language World Model is frozen, and the Policy Model is optimized to select latent actions from the codebook that maximize the task‑specific reward, effectively performing RL in the compressed latent space.

Experiments evaluate the approach on two multimodal dialogue benchmarks—multimodal role‑playing dialogue and multimodal personalized dialogue—using Qwen2.5‑VL‑3B‑Instruct and Qwen2.5‑VL‑7B‑Instruct. The authors compare token‑based RL fine‑tuning with latent‑action RL across four RL algorithms (GRPO, Dr.GRPO, DAPO, BNPO). Results show that latent‑action RL consistently outperforms strong baselines, demonstrating the method’s plug‑and‑play nature. Ablation studies removing (a) the cycle‑consistency loss, (b) the cross‑modal projector, or (c) the pure‑text data each cause noticeable performance drops, confirming the necessity of these components. A rollout‑diversity analysis further reveals that latent‑action RL yields significantly higher diversity than token‑based RL, supporting more efficient exploration.

In summary, the paper introduces a compact latent‑action space for RL fine‑tuning of MCAs, leverages both paired and unpaired data via a cycle‑consistent cross‑modal projector, and demonstrates substantial performance gains and improved exploration across multiple RL algorithms and multimodal dialogue tasks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

multimodalReinforcement LearningVision-Language Modelsdialogue agentslatent actions
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.