CPL++: A Self‑Aware, Self‑Correcting Framework for Weakly Supervised Visual Grounding
The CPL++ framework equips weakly supervised visual grounding models with confidence‑aware pseudo‑label learning, self‑supervised association correction, and dynamic validation, enabling the model to detect and amend erroneous region‑query links during training, which yields absolute performance gains of 1–6 % across five benchmark datasets.
Background and Motivation
Visual grounding aims to locate image regions from natural‑language queries. Fully supervised methods need dense image‑text‑box annotations, which are costly. Weakly supervised visual grounding uses only image‑text pairs but suffers from unreliable cross‑modal matching and error propagation.
Limitations of Existing Weak Supervision
Prior weakly supervised approaches treat grounding as a retrieval problem, relying on cross‑modal similarity scores or reconstruction losses. The gap between high‑level language concepts and pixel‑level visual features leads to many false pseudo‑associations. Earlier unsupervised methods generate rigid pseudo‑queries lacking diversity and still ignore the impact of erroneous associations.
Proposed Framework: CPL and CPL++
Confidence‑aware Pseudo‑label Learning (CPL) introduces three complementary pseudo‑query generation pipelines—Heuristic+, Object‑Centric, and Relation‑Aware—to produce descriptive, realistic, and diverse pseudo‑queries for each candidate region. Similarity between the real query and pseudo‑queries is computed in the text feature space; the region with highest similarity becomes the initial pseudo‑label, avoiding direct cross‑modal alignment.
Static Cross‑Modal Verification
A frozen pre‑trained vision‑language model evaluates each region‑query pair before training and outputs a confidence score. Pairs with scores below a threshold are filtered, reducing the influence of false associations.
CPL++: Self‑Supervised Association Correction
CPL++ builds a semantic‑aware candidate pool using category, attribute, and spatial relation information extracted from the query. A composite scoring function combines query‑region matching and detector confidence (shown in the figure). During training, if the IoU between the model’s predicted box and the best candidate falls below a threshold, the association is treated as erroneous, re‑weighted, and a refined pseudo‑label is generated.
Dynamic Self‑Supervised Verification
CPL++ upgrades the static verifier to a dynamic mechanism. Training loss of each sample is monitored; samples with higher loss receive larger weights via a dynamic selective localization loss, allowing the model to focus on correcting noisy labels while still leveraging the static prior.
Experimental Evaluation
The method is evaluated on five weakly supervised visual grounding benchmarks: RefCOCO, RefCOCO+, RefCOCOg, ReferItGame, and Flickr30K Entities. CPL outperforms existing weakly and unsupervised methods. CPL++ adds absolute improvements of 2.78 %, 5.81 %, 1.08 %, 2.03 %, and 2.55 % on the respective datasets, narrowing the gap to fully supervised approaches.
Qualitative Analysis
Visualizations show that CPL generates diverse, accurate pseudo‑queries, and CPL++’s correction module progressively refines erroneous associations, ultimately aligning predicted boxes with the true target regions.
Paper: https://ieeexplore.ieee.org/document/11433810/
Code: https://github.com/oceanflowlab/CPL
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
