Latent Action RL Shrinks Exploration Space for Multimodal Dialogue Fine‑Tuning
By learning a compact latent‑action space from paired image‑text and large‑scale text data, the authors reduce the RL search space from a vocabulary of over 150 k tokens to a 128‑codebook, enabling more efficient fine‑tuning of multimodal conversational agents and achieving consistent gains across several RL algorithms.
Vision‑Language Models (VLMs) have become the backbone of multimodal conversational agents (MCAs), but fine‑tuning them with reinforcement learning (RL) faces an exponential action‑space problem: with a vocabulary size |V| and maximum response length m, the sampling space grows as V^m.
To mitigate this, the paper proposes constructing a compact latent‑action space that compresses the per‑step search from |V| (e.g., Qwen2.5‑VL’s 152 k tokens) to a codebook of size |C| (e.g., 128), achieving orders‑of‑magnitude reduction.
The latent space is built by jointly exploiting paired image‑text data and massive pure‑text corpora. A cross‑modal projector P maps text embeddings to joint image‑text embeddings, while an inverse projector P' maps back to text embeddings. Training proceeds in two steps: (1) initialize P and P' on paired data; (2) further train both on paired data plus pure‑text data using a novel cycle‑consistency loss.
Model design comprises four modules: (1) Policy Model – predicts the latent action for the current step from the current text and image; (2) Inverse Dynamics Model – infers the latent action from a future step (used only for learning); (3) Language World Model – predicts the next token given the current observation and latent action; (4) Latent Action Codebook – a set of N vectors. Training is split into two stages. Stage 1 (Inverse Dynamics Learning) jointly trains the Inverse Dynamics Model, Language World Model, and the codebook by inferring a discrete latent action from future observations and reconstructing the next token. Stage 2 (Policy Behavior Cloning) trains the Policy Model to output latent actions that match those inferred in Stage 1.
During downstream RL, the Language World Model is frozen, and the Policy Model is optimized to select latent actions from the codebook that maximize the task‑specific reward, effectively performing RL in the compressed latent space.
Experiments evaluate the approach on two multimodal dialogue benchmarks—multimodal role‑playing dialogue and multimodal personalized dialogue—using Qwen2.5‑VL‑3B‑Instruct and Qwen2.5‑VL‑7B‑Instruct. The authors compare token‑based RL fine‑tuning with latent‑action RL across four RL algorithms (GRPO, Dr.GRPO, DAPO, BNPO). Results show that latent‑action RL consistently outperforms strong baselines, demonstrating the method’s plug‑and‑play nature. Ablation studies removing (a) the cycle‑consistency loss, (b) the cross‑modal projector, or (c) the pure‑text data each cause noticeable performance drops, confirming the necessity of these components. A rollout‑diversity analysis further reveals that latent‑action RL yields significantly higher diversity than token‑based RL, supporting more efficient exploration.
In summary, the paper introduces a compact latent‑action space for RL fine‑tuning of MCAs, leverages both paired and unpaired data via a cycle‑consistent cross‑modal projector, and demonstrates substantial performance gains and improved exploration across multiple RL algorithms and multimodal dialogue tasks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
