Artificial Intelligence 14 min read

Enhancing Reinforcement Learning with Label-Sensitive Reward for Natural Language Understanding

This paper introduces RLLR, a label‑sensitive reward reinforcement learning method that improves natural language understanding tasks by aligning training objectives with label accuracy, and demonstrates its effectiveness across eight public NLU datasets and real‑world advertising feature evaluation, outperforming standard RLHF and SFT baselines.

Tencent Advertising Technology

Aug 15, 2024

Enhancing Reinforcement Learning with Label-Sensitive Reward for Natural Language Understanding

ACL 2024 was held in Bangkok, and the Tencent Advertising Technology team had two papers accepted; this article describes one of them, titled Enhancing Reinforcement Learning with Label‑Sensitive Reward for Natural Language Understanding .

While Reinforcement Learning from Human Feedback (RLHF) has achieved impressive results on natural language generation (NLG) tasks, its application to natural language understanding (NLU) remains under‑explored. Existing RLHF pipelines focus on ranking model replies, which often produce rationale‑sensitive pairs where the label stays the same, leading to a mismatch between training objectives and the label‑accuracy metric that NLU tasks care about.

To address this, the authors propose RLLR (Reinforcement Learning with Label‑Sensitive Reward). They convert NLU tasks into a natural‑language format where each response contains a rationale and a label. By constructing label‑sensitive pairs (same question, different labels) and rationale‑sensitive pairs (same label, different rationales), they train a dedicated Reward Model that directly optimizes label correctness. A mixed approach, RLLR‑mixed, combines the original RLHF Reward Model with the new label‑sensitive Reward Model, weighting their outputs to produce a final reward.

The method was evaluated on eight public NLU datasets (MovieReviews, AGNews, AppReviews, MRPC, QQP, MNLI, SST‑2, STS‑B) using five base LLMs (LLaMA‑2, ChatGLM‑3, Mistral, Baichuan‑2, BLOOM). Results show that RLLR improves average label accuracy by 1.54 % over a supervised‑fine‑tuning (SFT) baseline and by 0.69 % over standard RLHF. RLLR‑mixed further boosts both label accuracy and rationale quality.

Analysis reveals that merely adding rationales during SFT (SFT w. rat.) yields modest gains, whereas the inverse‑reinforcement‑learning style of RLLR more effectively reduces cumulative error and aligns training with the label‑accuracy goal. The RLLR Reward Model achieves higher accuracy on label‑sensitive pairs than the RLHF Reward Model, confirming its superior guidance for the policy model.

Case studies illustrate that RLLR can correctly interpret nuanced sentiment in reviews where RLHF fails, and RLLR‑mixed can generate richer, markdown‑formatted rationales.

The approach was deployed in Tencent’s advertising feature‑evaluation pipeline, where semantic features (e.g., category, brand) are judged for correctness. On this real‑world task, RLLR consistently outperformed SFT and RLHF baselines, and RLLR‑mixed produced higher‑quality rationales in 95 % of evaluated samples.

In summary, the paper presents a label‑sensitive reinforcement learning framework that resolves the objective mismatch of RLHF for NLU, demonstrates consistent improvements across diverse datasets and models, and validates its practical impact in large‑scale advertising systems.

References

[1] Long Ouyang et al., 2022. Training language models to follow instructions with human feedback. NeurIPS. [2] Seungone Kim et al., 2023. The CoT collection: Improving zero‑shot and few‑shot learning of language models via chain‑of‑thought fine‑tuning. [3] Nathan Lambert et al., 2023. The alignment ceiling: Objective mismatch in reinforcement learning from human feedback. [4] Lee et al., 2023. RLAIF: Scaling reinforcement learning from human feedback with AI feedback. [5] Hao Sun, 2023. Reinforcement learning in the era of LLMs: What is essential? What is needed? [6] Stéphane Ross et al., 2011. A reduction of imitation learning and structured prediction to no‑regret online learning. [7] Jonathan Ho et al., 2016. Generative adversarial imitation learning.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Advertising reinforcement learning RLHF label-sensitive reward Natural Language Understanding

Written by

Tencent Advertising Technology

Official hub of Tencent Advertising Technology, sharing the team's latest cutting-edge achievements and advertising technology applications.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.