Artificial Intelligence 14 min read

How Decomposed Linguistic Representations Overcome Language Priors in VQA

This article reviews a AAAI 2020 paper that introduces a language‑attention based Visual Question Answering model which decomposes questions into type, object, and concept expressions to mitigate language bias, explains its modular architecture, and demonstrates superior performance on VQA‑CP v2 through extensive experiments and ablations.

Alibaba Cloud Developer

Dec 26, 2019

How Decomposed Linguistic Representations Overcome Language Priors in VQA

Research Motivation

Visual Question Answering (VQA) combines computer vision and natural language processing, requiring models to answer questions about images. Existing VQA models often rely on superficial language priors, answering based on question‑answer co‑occurrences rather than visual content, which limits their applicability in scenarios such as online education and visual assistance for the blind.

Proposed Method

The authors propose a language‑attention based VQA framework that learns decomposed linguistic representations of a question: a type expression , an object expression , and a concept expression . These representations are processed by four modules:

Language Attention Module : combines hard and soft attention to separate type, object, and concept phrases.

Question Recognition Module : uses the type expression to identify the question type and generate a Q&A mask indicating plausible answers.

Object Reference Module : employs a top‑down attention mechanism guided by the object expression to focus on relevant image regions.

Visual Verification Module : measures the similarity between the attended visual region and the concept expression (or candidate answers) to infer the final answer.

During training, the model is supervised by a combination of losses for language attention, question type classification, Q&A mask generation (using KL divergence), and visual verification.

Experiments

The method is evaluated on the VQA‑CP v2 dataset, which deliberately reshuffles answer distributions to expose language bias, as well as on the standard VQA v2 validation set. Results show that the proposed approach outperforms recent VQA models, achieving higher accuracy by effectively decoupling language‑based concept discovery from visual concept verification.

Ablation studies confirm the importance of each component: removing the language attention module, the thresholding mechanism, or the Q&A mask all lead to performance drops, demonstrating that decomposed linguistic representations are crucial for reducing language priors.

Conclusion

The paper presents a transparent, modular VQA system that learns flexible decomposed question representations, forces the model to rely on visual evidence, and provides interpretable intermediate outputs such as phrase embeddings, attention maps, and Q&A masks. Experiments validate its effectiveness in mitigating language bias.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Attention Mechanism Multimodal Learning Visual Question Answering language bias VQA-CP

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.