RIVAL: Adversarial RL Framework Elevates Conversational Subtitle Translation
RIVAL (Reinforcement Learning with Iterative and Adversarial Optimization) introduces an adversarial game between a reward model and a translation LLM, combining qualitative preference rewards with quantitative metrics like BLEU, to overcome distribution shift in RLHF and achieve superior performance on conversational subtitle and WMT translation tasks.
Overview
This paper proposes RIVAL (Reinforcement Learning with Iterative and Adversarial Optimization), an iterative adversarial reinforcement‑learning framework for machine translation (MT). We observe that RLHF based on human feedback performs poorly on conversational subtitle translation due to distribution shift between the reward model (RM) and the translation model (LLM), causing training failure.
Adversarial game mechanism: model the optimization of RM and LLM as a min‑max game, where RM distinguishes strong and weak translations and LLM optimizes weak translations to narrow the quality gap.
Dual‑reward design: combine qualitative preference rewards aligned with semantic similarity and quantitative preference rewards (e.g., BLEU) to improve stability and generalization of iterative RL training.
Experiments show RIVAL significantly outperforms supervised fine‑tuning (SFT) and dedicated translation models (e.g., Tower‑7B‑v0.2) on both conversational subtitle and WMT datasets while maintaining cross‑language generalization.
Motivation
Large language models (LLMs) exhibit breakthrough capabilities across tasks, offering a new paradigm for MT. Most research relies on supervised fine‑tuning (SFT) with maximum likelihood, which suffers from exposure bias and error accumulation, especially for informal, slang‑rich subtitle data lacking high‑quality parallel corpora. Traditional evaluation metrics like BLEU fail in semantic‑alignment‑focused scenarios, prompting us to build a large conversational subtitle dataset and explore RLHF for this domain.
We found RLHF often produces “reward hacking”: the LLM adds extraneous phrases (e.g., “It's ok! It's great!”) not present in the source, violating semantic fidelity.
Method
3.1 RIVAL Framework: Adversarial Iterative Optimization
RIVAL reformulates the two‑stage RLHF training into an adversarial game between RM and LLM, inspired by GANs. The min‑max objective is:
where r_Φ is the reward model (discriminator) distinguishing strong and weak translations, π_θ is the translation model (generator) approximating the strong‑translation distribution P_strong, and π_ref is a reference model used to constrain KL divergence and prevent excessive drift. By iteratively updating the LLM and using its current outputs to train the RM, the RM becomes an online model that adapts to distribution shift.
3.2 RM and LLM Optimization
When optimizing the RM (LLM fixed), the objective simplifies to a rank loss that maximizes the score gap between strong and weak translations:
During LLM optimization (RM fixed), the goal is to maximize the reward score provided by the RM, using the GRPO algorithm:
3.3 Incorporating Quantitative Preference Rewards
To stabilize training, we design a multi‑head RM that predicts both qualitative preference rewards and quantitative rewards such as BLEU. The total RM loss combines the rank loss with an MAE loss for BLEU prediction:
The overall algorithm flow is illustrated below:
Experiments
We evaluate RIVAL on our self‑built conversational subtitle dataset and the standard WMT benchmark. For subtitles we use GPT‑4o multi‑dimensional scores (accuracy, completeness, coherence, style) and COMETKiwi; for WMT we report BLEU and COMETKiwi.
RIVAL‑Iter1 (qualitative reward only) achieves an average GPT‑4o score of 3.68 (+5.5% over baseline) and COMETKiwi 66.27. Adding quantitative BLEU reward (RIVAL‑Iter2‑Qual+Quant) improves BLEU and COMETKiwi on both English‑Chinese and Chinese‑English tasks, demonstrating the complementary effect of dual rewards.
In out‑of‑distribution tests (e.g., medical German‑Chinese translation), RIVAL‑Iter1 maintains higher COMETKiwi (53.42) than SFT (49.15), showing better robustness.
Conclusion
RIVAL addresses distribution shift in RLHF for conversational subtitle translation by framing RM and LLM optimization as a min‑max game and introducing a dual‑reward mechanism. Experiments on subtitle and WMT tasks demonstrate superior performance over baselines, SFT, and specialized models, as well as improved out‑of‑domain robustness. Future work will explore iteration limits and computational efficiency.
References
Ouyang L, Wu J, Jiang X, et al. Training language models to follow instructions with human feedback. NeurIPS 2022.
Luo W, Li H, Zhang Z, et al. Sambo‑rl: Shifts‑aware model‑based offline reinforcement learning. arXiv 2024.
Goodfellow I J, et al. Generative adversarial nets. NeurIPS 2014.
Shao Z, Wang P, Zhu Q, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv 2024.
Achiam J, et al. GPT‑4 technical report. arXiv 2023.
Rei R, Treviso M, Guerreiro N M, et al. COMETKiwi: IST‑unbabel 2022 submission for the quality estimation shared task. arXiv 2022.
Shoeybi M, et al. Megatron‑LM: Training multi‑billion parameter language models using model parallelism. arXiv 2019.
Sheng G, Zhang C, Ye Z, et al. HybridFlow: A flexible and efficient RLHF framework. Proceedings of the 20th European Conference on Computer Systems, 2025.
Guo D, Yang D, Zhang H, et al. DeepSeek‑R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv 2025.
Alves D M, Pombal J, Guerreiro N M, et al. Tower: An open multilingual large language model for translation‑related tasks. arXiv 2024.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
