How Triple Alignment and Rationale Generation Supercharge Knowledge‑Based VQA

This paper presents a lightweight, high‑efficiency framework called Triple Alignment with Rationale Generation (TAG) that transforms knowledge‑based visual question answering into a contrastive learning task, dramatically reducing trainable parameters while achieving state‑of‑the‑art performance on major KVQA benchmarks.

AntTech
AntTech
AntTech
How Triple Alignment and Rationale Generation Supercharge Knowledge‑Based VQA

Background

Knowledge‑based visual question answering (KVQA) traditionally depends on external knowledge retrieval (which introduces noise) and very large language models (hundreds of billions of parameters), leading to high inference latency and costly deployment.

Method: TAG (Triple Alignment with Rationale Generation)

TAG converts KVQA into a contrastive‑learning problem by jointly aligning three heterogeneous feature spaces while keeping the CLIP backbone frozen:

Image‑Feature Alignment : Align CLIP’s frozen image embeddings with visual features extracted from the input image.

Global Contrast Alignment : Perform contrastive learning over all positive and negative samples in the global CLIP feature space to push apart hard negatives.

Text‑Feature Alignment : Align CLIP’s frozen text embeddings with the question‑option representations encoded by a lightweight language model (e.g., BART).

Only a tiny trainable language model ( 0.0152B parameters) is fine‑tuned; the frozen CLIP parameters remain unchanged, preserving image information.

Rationale Generation

During training an auxiliary decoder generates a natural‑language rationale for each answer. This forces the model to perform deeper logical reasoning and mitigates shortcut learning.

Experimental Evaluation

TAG (total 0.387B parameters) achieves state‑of‑the‑art results on major KVQA benchmarks:

A‑OKVQA validation accuracy: 67.9% , test accuracy: 61.2% .

OK‑VQA accuracy: 52.1% .

VCR accuracy: 70.4% .

These numbers surpass much larger models such as GPT‑3 (170 B parameters) while using roughly 1/400 of the parameter count.

Zero‑shot evaluation demonstrates strong generalisation: 45.6 % on A‑OKVQA(DA), 52.1 % on OK‑VQA, and 70.4 % on VCR.

Ablation studies confirm that both the triple‑alignment mechanism and the rationale‑generation module are essential for the observed performance gains.

Efficiency

Because only a small language model is trained and CLIP remains frozen, TAG incurs low computational cost and low latency, making it suitable for large‑scale industrial deployment where real‑time inference is required.

Paper: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11129918

Framework overview diagram
Framework overview diagram
contrastive learningmultimodalCLIPLightweight Modelrationale generationVQA
AntTech
Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.