How Decomposed Linguistic Representations Overcome Language Priors in VQA
This article reviews a AAAI 2020 paper that introduces a language‑attention based Visual Question Answering model which decomposes questions into type, object, and concept expressions to mitigate language bias, explains its modular architecture, and demonstrates superior performance on VQA‑CP v2 through extensive experiments and ablations.
Research Motivation
Visual Question Answering (VQA) combines computer vision and natural language processing, requiring models to answer questions about images. Existing VQA models often rely on superficial language priors, answering based on question‑answer co‑occurrences rather than visual content, which limits their applicability in scenarios such as online education and visual assistance for the blind.
Proposed Method
The authors propose a language‑attention based VQA framework that learns decomposed linguistic representations of a question: a type expression , an object expression , and a concept expression . These representations are processed by four modules:
Language Attention Module : combines hard and soft attention to separate type, object, and concept phrases.
Question Recognition Module : uses the type expression to identify the question type and generate a Q&A mask indicating plausible answers.
Object Reference Module : employs a top‑down attention mechanism guided by the object expression to focus on relevant image regions.
Visual Verification Module : measures the similarity between the attended visual region and the concept expression (or candidate answers) to infer the final answer.
During training, the model is supervised by a combination of losses for language attention, question type classification, Q&A mask generation (using KL divergence), and visual verification.
Experiments
The method is evaluated on the VQA‑CP v2 dataset, which deliberately reshuffles answer distributions to expose language bias, as well as on the standard VQA v2 validation set. Results show that the proposed approach outperforms recent VQA models, achieving higher accuracy by effectively decoupling language‑based concept discovery from visual concept verification.
Ablation studies confirm the importance of each component: removing the language attention module, the thresholding mechanism, or the Q&A mask all lead to performance drops, demonstrating that decomposed linguistic representations are crucial for reducing language priors.
Conclusion
The paper presents a transparent, modular VQA system that learns flexible decomposed question representations, forces the model to rely on visual evidence, and provides interpretable intermediate outputs such as phrase embeddings, attention maps, and Q&A masks. Experiments validate its effectiveness in mitigating language bias.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
