How Triple Alignment and Rationale Generation Supercharge Knowledge‑Based VQA
This paper presents a lightweight, high‑efficiency framework called Triple Alignment with Rationale Generation (TAG) that transforms knowledge‑based visual question answering into a contrastive learning task, dramatically reducing trainable parameters while achieving state‑of‑the‑art performance on major KVQA benchmarks.
Background
Knowledge‑based visual question answering (KVQA) traditionally depends on external knowledge retrieval (which introduces noise) and very large language models (hundreds of billions of parameters), leading to high inference latency and costly deployment.
Method: TAG (Triple Alignment with Rationale Generation)
TAG converts KVQA into a contrastive‑learning problem by jointly aligning three heterogeneous feature spaces while keeping the CLIP backbone frozen:
Image‑Feature Alignment : Align CLIP’s frozen image embeddings with visual features extracted from the input image.
Global Contrast Alignment : Perform contrastive learning over all positive and negative samples in the global CLIP feature space to push apart hard negatives.
Text‑Feature Alignment : Align CLIP’s frozen text embeddings with the question‑option representations encoded by a lightweight language model (e.g., BART).
Only a tiny trainable language model ( 0.0152B parameters) is fine‑tuned; the frozen CLIP parameters remain unchanged, preserving image information.
Rationale Generation
During training an auxiliary decoder generates a natural‑language rationale for each answer. This forces the model to perform deeper logical reasoning and mitigates shortcut learning.
Experimental Evaluation
TAG (total 0.387B parameters) achieves state‑of‑the‑art results on major KVQA benchmarks:
A‑OKVQA validation accuracy: 67.9% , test accuracy: 61.2% .
OK‑VQA accuracy: 52.1% .
VCR accuracy: 70.4% .
These numbers surpass much larger models such as GPT‑3 (170 B parameters) while using roughly 1/400 of the parameter count.
Zero‑shot evaluation demonstrates strong generalisation: 45.6 % on A‑OKVQA(DA), 52.1 % on OK‑VQA, and 70.4 % on VCR.
Ablation studies confirm that both the triple‑alignment mechanism and the rationale‑generation module are essential for the observed performance gains.
Efficiency
Because only a small language model is trained and CLIP remains frozen, TAG incurs low computational cost and low latency, making it suitable for large‑scale industrial deployment where real‑time inference is required.
Paper: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11129918
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
