How POPEN Boosts LVLM Reasoning Segmentation with Preference Optimization and Ensemble
The paper introduces POPEN, a new framework that uses preference‑based optimization and ensemble methods to reduce hallucinations and improve segmentation accuracy in large visual language models, achieving state‑of‑the‑art results on multiple benchmarks.
Overview
Large visual language models (LVLMs) often generate hallucinated text and imprecise segmentation masks when following complex instructions. POPEN addresses these issues by fine‑tuning LVLMs with human‑derived preference data and by aggregating multiple model outputs using a preference‑driven attention ensemble.
Preference Data Collection
Textual preference ( P_t ) : For each image‑instruction pair {I, x}, a base LVLM produces a response y containing <seg> tags. ChatGPT edits y to a corrected version y_c. The tuple {I, x, y, y_c, L_y, L_{y_c}} forms the textual preference set, where L_y and L_{y_c} are token‑level index lists.
Segmentation embedding preference ( P_s ) : Random Gaussian noise is added to three rectangular regions of the input image, creating three perturbed images. The LVLM processes each perturbed image with the same instruction, yielding three segmentation embeddings and masks. These masks are ranked by segmentation quality (high, medium, low) to produce preference lists for curriculum learning.
Curriculum Learning and Noise Injection
Fine‑tuning proceeds in two stages. First, high‑ and low‑quality preference samples are used to teach basic object localization. Then, medium‑quality samples focus on boundary refinement. Random noise injection generates diverse embeddings, enabling effective collection of P_s data.
Preference Optimization Loss
The total loss combines a textual DPO loss L_t and a segmentation DPO loss L_s:
Loss = L_t(high, low) + L_s(mask_{ref}, mask_{ft}) L_tmeasures similarity between high‑quality and low‑quality textual responses; L_s measures similarity between reference and fine‑tuned masks.
Preference‑Based Ensemble
For each LVLM output, a preference score is computed. The scores modulate attention weights γ in the query‑key mechanism Q(E) and K(E), giving higher weight to more reliable outputs.
Experimental Setup
Datasets: ADE20K, COCO‑Stuff, LVIS‑PACO, RefCOCO series, LLAVA‑150k, MUSE.
Model: PixelLM backbone with pre‑trained LLaVA‑7B / LLaVA‑Llama2‑13B as LVLM and CLIP‑ViT‑L/14‑336 visual encoder.
Hyper‑parameters: 30 perturbed images per sample, preference‑optimization coefficient β = 10, ensemble uses K = 3 responses.
Metrics: Generalized IoU (gIoU), class‑wise IoU (cIoU), CHAIR (object hallucination), and ChatGPT‑based quality scores.
Results
On the MUSE benchmark, POPEN outperforms LISA, PixelLM, and other baselines across all metrics, notably reducing hallucinations and achieving the highest segmentation accuracy. Similar improvements are observed on RefCOCO, RefCOCO+, and RefCOCOg datasets.
Ablation studies show that removing either the preference‑based optimization or the ensemble degrades both textual quality and segmentation precision, confirming the necessity of both components and of the combined loss.
Conclusion
POpen demonstrates that aligning LVLMs with human preferences through a dedicated optimization loss and a preference‑driven ensemble substantially improves reasoning‑guided segmentation, advancing reliable pixel‑level understanding in multimodal models.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Tencent Advertising Technology
Official hub of Tencent Advertising Technology, sharing the team's latest cutting-edge achievements and advertising technology applications.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
