Can Reinforcement Learning Spot Hallucinations in LLMs? Introducing RL4HS
Apple’s new paper presents RL4HS, a reinforcement‑learning framework that uses span‑level rewards and class‑aware policy optimization to detect hallucinated text spans in large language models, outperforming GPT‑5 and other baselines and offering more precise, auditable error identification.
Apple has recently released a paper that proposes a novel reinforcement‑learning framework called RL4HS (Reinforcement Learning for Hallucination Span detection) to identify hallucinated text spans in large language models (LLMs).
Motivation
LLMs often generate hallucinations—statements that are factually incorrect or unsupported. Traditional hallucination detection treats the problem as a binary classification, but many applications require pinpointing the exact erroneous spans, which is a more complex, multi‑step decision process.
Method
RL4HS introduces span‑level rewards and a Class‑Aware Group Relative Policy Optimization (GRPO) to encourage models to reason explicitly about each token. The method also incorporates a class‑aware policy optimization (CAPO) that scales the advantage of non‑hallucination samples to mitigate reward imbalance.
Experimental Setup
Experiments were conducted on the RAGTruth benchmark, covering summarization, question answering, and data‑to‑text tasks. The base models were Qwen2.5‑7B‑Instruct and Qwen2.5‑14B‑Instruct, with comparisons against pre‑trained reasoning models (Qwen3‑8B, Qwen3‑14B, QwQ‑32B) and commercial models (GPT‑5, o3, GPT‑4o‑mini, GPT‑5‑mini).
RL4HS outperformed both pre‑trained reasoning models and traditional supervised fine‑tuning baselines.
When evaluated with multiple sampling (K>1), chain‑of‑thought (CoT) reasoning showed significant gains, confirming its utility for hallucination span detection.
Results
Key findings include:
Pre‑trained instruction‑tuned models (Qwen2.5‑7B/14B‑Instruct) achieved F1 scores below 30, indicating that prompting alone is insufficient for precise span detection.
Pre‑trained reasoning models showed modest improvements (e.g., Qwen3‑14B F1 ↑ to 35.8) but still lagged behind fine‑tuned baselines.
Supervised fine‑tuning raised F1 to 55.4 for 14B models.
RL4HS achieved average F1 of 55.9 (7B) and 57.6/54.8/62.6 across tasks for the 14B version, surpassing the strongest commercial baselines (GPT‑5, o3).
Analysis
The paper also provides qualitative analysis showing that RL4HS can correctly identify hallucinated claims that pre‑trained models miss, demonstrating systematic, auditable reasoning rather than superficial token‑level heuristics.
Overall, the span‑level reward and class‑balanced optimization in RL4HS represent a significant step toward more reliable and auditable LLM outputs.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
