Prompting Large Language Models for Knowledge‑Based Visual Question Answering: The Prophet Framework
This article analyzes the Prophet framework, which leverages a traditional VQA model to generate answer candidates and in‑context examples that prompt GPT‑3, achieving state‑of‑the‑art performance on the challenging OK‑VQA and A‑OKVQA benchmarks.
Introduction
Large language models such as ChatGPT excel at direct human interaction, prompting researchers to explore whether small, domain‑specific models can benefit from their knowledge. The paper "Prompting Large Language Models with Answer Heuristics for Knowledge‑based Visual Question Answering" proposes exactly this idea.
Background
Visual Question Answering (VQA) requires a model to answer a question about an image. Traditional VQA relies solely on training data, but ideal answers often need external knowledge. Early VQA benchmarks supplied a knowledge base (KB) to assist training, yet constructing such KBs is labor‑intensive and limits real‑world applicability. Recent benchmarks (e.g., OK‑VQA) remove the KB and instead draw from open resources like Wikipedia, introducing the problem of retrieving irrelevant or poorly aligned knowledge.
The PICa method addresses this by converting the image to a textual description, selecting similar Q&A pairs as in‑context examples, and prompting GPT‑3. PICa suffers from two issues: (1) the generated description may not align with the question, causing large answer deviations; (2) reliance on similarity‑based in‑context examples is insufficient for high‑quality answers.
Prompt Construction for VQA
For GPT‑3, a prompt follows the triple <task description, in‑context examples, task input>. In VQA, the task description states the VQA task, in‑context examples consist of <image caption, question, answer> triples, and the task input is <image caption, question, [blank]> where the blank is the answer to be generated.
Prophet Framework
Prophet introduces two upstream components generated by a conventional VQA model (e.g., MCAN):
Answer candidate set: the top‑K answers from the model’s confidence vector, each paired with its confidence score.
In‑context examples: N training samples whose fused image‑question features have the highest cosine similarity to the target sample; each example is a <image, question, answer> triple.
These components are combined into an enriched prompt for GPT‑3. The prompt format becomes
<task description, in‑context examples (now <caption, question, answer‑candidate set, answer>), task input (caption, question, answer‑candidate set, [blank])>. The task description is also modified to encourage GPT‑3 to select an answer from the provided candidate set.
The workflow diagram (above) shows the generation of answer heuristics, selection of similar samples, prompt assembly, and querying GPT‑3.
Experimental Setup and Results
Experiments use two difficult VQA datasets: OK‑VQA and A‑OKVQA. The upstream VQA model is an improved MCAN pretrained on traditional VQA data. Baselines include the original MCAN, the improved MCAN without pretraining, and the Prophet‑enhanced MCAN.
Accuracy on the OK‑VQA test set for the baselines is shown in the table below:
Prophet is then compared against three groups of methods: (1) external KB as knowledge source, (2) large multimodal pre‑training, and (3) large language models. The comparative results on OK‑VQA are illustrated below:
Prophet markedly outperforms all baselines on the high‑difficulty OK‑VQA dataset, demonstrating the effectiveness of using answer heuristics to guide large language models.
Conclusion
By generating concise answer candidates and carefully selected in‑context examples from a small VQA model, Prophet enables GPT‑3 to produce accurate answers for knowledge‑intensive VQA tasks. The results suggest that appropriately prompting large models can yield performance gains far beyond the capabilities of the small model alone, achieving a synergistic effect where "1 + 1 > 2".
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Network Intelligence Research Center (NIRC)
NIRC is based on the National Key Laboratory of Network and Switching Technology at Beijing University of Posts and Telecommunications. It has built a technology matrix across four AI domains—intelligent cloud networking, natural language processing, computer vision, and machine learning systems—dedicated to solving real‑world problems, creating top‑tier systems, publishing high‑impact papers, and contributing significantly to the rapid advancement of China's network technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
