Artificial Intelligence 7 min read

MoLE: Decoding by Mixture of Layer Experts Alleviates Hallucination in Large Vision-Language Models

Researchers from Ant Insurance and Zhejiang University propose MoLE, a Mixture of Layer Experts decoding method that reduces hallucinations in large vision‑language models, demonstrating state‑of‑the‑art performance on LVLM benchmarks and enabling reliable end‑to‑end medical‑record‑to‑claim automation.

AntTech
AntTech
AntTech
MoLE: Decoding by Mixture of Layer Experts Alleviates Hallucination in Large Vision-Language Models

Hallucination—where generated content diverges from visual inputs and instructions—is a critical issue for Large Vision‑Language Models (LVLMs), especially in high‑stakes applications such as intelligent insurance claim processing.

Traditional mitigation relies on contrastive decoding, using a weaker "amateur" model to filter erroneous outputs, but this approach is limited by the quality and availability of the auxiliary model.

To overcome these limitations, Ant Insurance and Zhejiang University investigated the root causes of LVLM hallucinations, identifying that errors can arise during reasoning, factual information injection, and prompt forgetting as token generation progresses.

Inspired by the Mixture of Experts (MoE) paradigm, they introduced a training‑free decoding technique called Mixture of Layer Experts (MoLE). MoLE employs a heuristic gating mechanism to dynamically select multiple layers of an LVLM as expert layers, combining their logits to produce more robust and faithful outputs.

MoLE defines three key expert layers:

Final Expert : the last layer, responsible for the final prediction.

Second‑Opinion Expert : a layer from the top few layers that offers an alternative viewpoint, selected based on divergent logits on critical tokens.

Prompt‑Retention Expert : the layer with the highest attention scores on prompt tokens, ensuring the model retains the original instruction.

The decoding process consists of four steps:

Select the Final Expert layer.

Select the Second‑Opinion Expert layer.

Select the Prompt‑Retention Expert layer.

Aggregate the logits from the three experts in a single forward pass to generate the final prediction, avoiding the computational overhead of multiple passes.

Experiments were conducted on three state‑of‑the‑art LVLMs—MiniGPT‑4, LLaVA‑1.5, and Shikra (all using Vicuna‑7B as the decoder). MoLE was compared against existing baselines on two benchmark suites: POPE (object detection) and CHAIR (long‑text hallucination).

Results show that MoLE consistently outperforms prior methods, achieving SOTA reductions in hallucination. Notably, on MiniGPT‑4 the CHAIR‑S metric improves by 21% compared to the DoLA baseline, highlighting MoLE’s effectiveness in long‑text generation.

Beyond benchmark performance, MoLE has been deployed in Ant Insurance’s intelligent claim workflow, enabling an end‑to‑end “medical‑record‑to‑claim” system that powers the “秒赔” (instant claim) service of the 安心赔 pilot product, reducing information loss and error propagation in the claim pipeline.

Overall, MoLE demonstrates that a lightweight, layer‑wise expert mixture can substantially mitigate hallucinations in LVLMs, paving the way for more reliable multimodal AI applications such as automated insurance claim processing.

AIMixture of ExpertsVision-Language ModelsHallucination MitigationInsurance Automation
AntTech
Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.