Artificial Intelligence 6 min read

How Reinforcement Learning Cuts Hallucinations in Large Language Models: Ant Insurance’s Proven Approach

Ant Insurance’s tech team leveraged reinforcement learning, focused data selection, and a multi‑dimensional reward system to dramatically reduce hallucinations in LLMs, achieving top‑rank performance on the HHEM leaderboard and robust improvements across instruction‑following and reasoning‑enhanced models.

AntTech

Sep 19, 2025

How Reinforcement Learning Cuts Hallucinations in Large Language Models: Ant Insurance’s Proven Approach

Recently, Ant Insurance’s technology team made significant progress in hallucination control, ranking Top 1 on the public HHEM leaderboard and markedly improving factual compliance in question‑answering services.

Our approach reduces model hallucinations not by complex tricks but by returning to fundamentals: meticulous data selection and carefully crafted reward design.

Data Selection

We target the Grounded Generation task as the entry point. The pipeline includes:

Pre‑generation and Scoring : Generate many candidate answers with a base model, then score them for factuality using a reward model.

Difficulty‑Stratified Filtering : Retain medium‑to‑high difficulty samples—those where the model is likely to err—while discarding overly easy (low learning value) and overly hard (noisy) examples.

Input‑Space Coverage : Perform bucketed sampling on the filtered data to ensure diverse input lengths, enhancing robustness in real‑world scenarios.

Reward Design

We aim to guide the model to be "grounded" without stifling fluency.

Core Reward (LLM‑as‑Judge) : Use an LLM as the primary judge, classifying hallucination types (e.g., fabricated facts, misattributions) for precise scoring.

Auxiliary Constraints (Prevent Reward Hacking) :

Format reward – enforce output format compliance.

Language consistency reward – encourage answers in the same language as the question.

Stiffness and plagiarism penalty – penalize overly rigid, unnatural, or copied text.

Reward design is an iterative process; we continuously adjust weights and rules to balance hallucination suppression with the preservation of the model’s general abilities.

Why Reinforcement Learning Works

RL prunes the massive decision space of a language model by rewarding factual pathways, effectively cutting off branches that lead to hallucinations. This yields two main effects:

Benefit : Strong reduction of hallucinated content.

Risk : Over‑pruning can harm creativity, instruction diversity, or fluency if the reward signal is mis‑specified.

The goal is to find the sweet spot where hallucinations are minimized while other core capabilities remain intact.

Experimental Validation

We evaluated the method on different model families. Both the instruction‑following Qwen2.5‑72B and the reasoning‑enhanced Qwen3‑32B showed consistent and significant hallucination reductions after applying our post‑training strategy.

These results confirm that reinforcement‑learning‑based hallucination control is robust and ready for industrial deployment.

References

[1] Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM Reward Design data selection LLM-as-Judge Hallucination Control

Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.