RIVAL: Adversarial RL Framework Elevates Conversational Subtitle Translation

RIVAL (Reinforcement Learning with Iterative and Adversarial Optimization) introduces an adversarial game between a reward model and a translation LLM, combining qualitative preference rewards with quantitative metrics like BLEU, to overcome distribution shift in RLHF and achieve superior performance on conversational subtitle and WMT translation tasks.

Bilibili Tech
Bilibili Tech
Bilibili Tech
RIVAL: Adversarial RL Framework Elevates Conversational Subtitle Translation

Overview

This paper proposes RIVAL (Reinforcement Learning with Iterative and Adversarial Optimization), an iterative adversarial reinforcement‑learning framework for machine translation (MT). We observe that RLHF based on human feedback performs poorly on conversational subtitle translation due to distribution shift between the reward model (RM) and the translation model (LLM), causing training failure.

Adversarial game mechanism: model the optimization of RM and LLM as a min‑max game, where RM distinguishes strong and weak translations and LLM optimizes weak translations to narrow the quality gap.

Dual‑reward design: combine qualitative preference rewards aligned with semantic similarity and quantitative preference rewards (e.g., BLEU) to improve stability and generalization of iterative RL training.

Experiments show RIVAL significantly outperforms supervised fine‑tuning (SFT) and dedicated translation models (e.g., Tower‑7B‑v0.2) on both conversational subtitle and WMT datasets while maintaining cross‑language generalization.

Motivation

Large language models (LLMs) exhibit breakthrough capabilities across tasks, offering a new paradigm for MT. Most research relies on supervised fine‑tuning (SFT) with maximum likelihood, which suffers from exposure bias and error accumulation, especially for informal, slang‑rich subtitle data lacking high‑quality parallel corpora. Traditional evaluation metrics like BLEU fail in semantic‑alignment‑focused scenarios, prompting us to build a large conversational subtitle dataset and explore RLHF for this domain.

We found RLHF often produces “reward hacking”: the LLM adds extraneous phrases (e.g., “It's ok! It's great!”) not present in the source, violating semantic fidelity.

Method

3.1 RIVAL Framework: Adversarial Iterative Optimization

RIVAL reformulates the two‑stage RLHF training into an adversarial game between RM and LLM, inspired by GANs. The min‑max objective is:

where r_Φ is the reward model (discriminator) distinguishing strong and weak translations, π_θ is the translation model (generator) approximating the strong‑translation distribution P_strong, and π_ref is a reference model used to constrain KL divergence and prevent excessive drift. By iteratively updating the LLM and using its current outputs to train the RM, the RM becomes an online model that adapts to distribution shift.

3.2 RM and LLM Optimization

When optimizing the RM (LLM fixed), the objective simplifies to a rank loss that maximizes the score gap between strong and weak translations:

During LLM optimization (RM fixed), the goal is to maximize the reward score provided by the RM, using the GRPO algorithm:

3.3 Incorporating Quantitative Preference Rewards

To stabilize training, we design a multi‑head RM that predicts both qualitative preference rewards and quantitative rewards such as BLEU. The total RM loss combines the rank loss with an MAE loss for BLEU prediction:

The overall algorithm flow is illustrated below:

Experiments

We evaluate RIVAL on our self‑built conversational subtitle dataset and the standard WMT benchmark. For subtitles we use GPT‑4o multi‑dimensional scores (accuracy, completeness, coherence, style) and COMETKiwi; for WMT we report BLEU and COMETKiwi.

RIVAL‑Iter1 (qualitative reward only) achieves an average GPT‑4o score of 3.68 (+5.5% over baseline) and COMETKiwi 66.27. Adding quantitative BLEU reward (RIVAL‑Iter2‑Qual+Quant) improves BLEU and COMETKiwi on both English‑Chinese and Chinese‑English tasks, demonstrating the complementary effect of dual rewards.

In out‑of‑distribution tests (e.g., medical German‑Chinese translation), RIVAL‑Iter1 maintains higher COMETKiwi (53.42) than SFT (49.15), showing better robustness.

Conclusion

RIVAL addresses distribution shift in RLHF for conversational subtitle translation by framing RM and LLM optimization as a min‑max game and introducing a dual‑reward mechanism. Experiments on subtitle and WMT tasks demonstrate superior performance over baselines, SFT, and specialized models, as well as improved out‑of‑domain robustness. Future work will explore iteration limits and computational efficiency.

References

Ouyang L, Wu J, Jiang X, et al. Training language models to follow instructions with human feedback. NeurIPS 2022.

Luo W, Li H, Zhang Z, et al. Sambo‑rl: Shifts‑aware model‑based offline reinforcement learning. arXiv 2024.

Goodfellow I J, et al. Generative adversarial nets. NeurIPS 2014.

Shao Z, Wang P, Zhu Q, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv 2024.

Achiam J, et al. GPT‑4 technical report. arXiv 2023.

Rei R, Treviso M, Guerreiro N M, et al. COMETKiwi: IST‑unbabel 2022 submission for the quality estimation shared task. arXiv 2022.

Shoeybi M, et al. Megatron‑LM: Training multi‑billion parameter language models using model parallelism. arXiv 2019.

Sheng G, Zhang C, Ye Z, et al. HybridFlow: A flexible and efficient RLHF framework. Proceedings of the 20th European Conference on Computer Systems, 2025.

Guo D, Yang D, Zhang H, et al. DeepSeek‑R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv 2025.

Alves D M, Pombal J, Guerreiro N M, et al. Tower: An open multilingual large language model for translation‑related tasks. arXiv 2024.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMreinforcement learningadversarial trainingmachine translationBLEUreward modeling
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.