How POPEN Boosts LVLM Reasoning Segmentation with Preference Optimization and Ensemble

The paper introduces POPEN, a new framework that uses preference‑based optimization and ensemble methods to reduce hallucinations and improve segmentation accuracy in large visual language models, achieving state‑of‑the‑art results on multiple benchmarks.

Tencent Advertising Technology
Tencent Advertising Technology
Tencent Advertising Technology
How POPEN Boosts LVLM Reasoning Segmentation with Preference Optimization and Ensemble

Overview

Large visual language models (LVLMs) often generate hallucinated text and imprecise segmentation masks when following complex instructions. POPEN addresses these issues by fine‑tuning LVLMs with human‑derived preference data and by aggregating multiple model outputs using a preference‑driven attention ensemble.

Preference Data Collection

Textual preference ( P_t ) : For each image‑instruction pair {I, x}, a base LVLM produces a response y containing <seg> tags. ChatGPT edits y to a corrected version y_c. The tuple {I, x, y, y_c, L_y, L_{y_c}} forms the textual preference set, where L_y and L_{y_c} are token‑level index lists.

Segmentation embedding preference ( P_s ) : Random Gaussian noise is added to three rectangular regions of the input image, creating three perturbed images. The LVLM processes each perturbed image with the same instruction, yielding three segmentation embeddings and masks. These masks are ranked by segmentation quality (high, medium, low) to produce preference lists for curriculum learning.

Curriculum Learning and Noise Injection

Fine‑tuning proceeds in two stages. First, high‑ and low‑quality preference samples are used to teach basic object localization. Then, medium‑quality samples focus on boundary refinement. Random noise injection generates diverse embeddings, enabling effective collection of P_s data.

Preference Optimization Loss

The total loss combines a textual DPO loss L_t and a segmentation DPO loss L_s:

Loss = L_t(high, low) + L_s(mask_{ref}, mask_{ft})
L_t

measures similarity between high‑quality and low‑quality textual responses; L_s measures similarity between reference and fine‑tuned masks.

Preference‑Based Ensemble

For each LVLM output, a preference score is computed. The scores modulate attention weights γ in the query‑key mechanism Q(E) and K(E), giving higher weight to more reliable outputs.

Experimental Setup

Datasets: ADE20K, COCO‑Stuff, LVIS‑PACO, RefCOCO series, LLAVA‑150k, MUSE.

Model: PixelLM backbone with pre‑trained LLaVA‑7B / LLaVA‑Llama2‑13B as LVLM and CLIP‑ViT‑L/14‑336 visual encoder.

Hyper‑parameters: 30 perturbed images per sample, preference‑optimization coefficient β = 10, ensemble uses K = 3 responses.

Metrics: Generalized IoU (gIoU), class‑wise IoU (cIoU), CHAIR (object hallucination), and ChatGPT‑based quality scores.

Results

On the MUSE benchmark, POPEN outperforms LISA, PixelLM, and other baselines across all metrics, notably reducing hallucinations and achieving the highest segmentation accuracy. Similar improvements are observed on RefCOCO, RefCOCO+, and RefCOCOg datasets.

Ablation studies show that removing either the preference‑based optimization or the ensemble degrades both textual quality and segmentation precision, confirming the necessity of both components and of the combined loss.

Conclusion

POpen demonstrates that aligning LVLMs with human preferences through a dedicated optimization loss and a preference‑driven ensemble substantially improves reasoning‑guided segmentation, advancing reliable pixel‑level understanding in multimodal models.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Computer Visionpreference optimizationmultimodal modelsensemble learningSegmentationhallucination reductionLVLM
Tencent Advertising Technology
Written by

Tencent Advertising Technology

Official hub of Tencent Advertising Technology, sharing the team's latest cutting-edge achievements and advertising technology applications.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.