Artificial Intelligence 24 min read

Evaluating ChatGPT on Complex Knowledge‑Base Question Answering Using a Feature‑Driven Multi‑Label Framework

This study presents a comprehensive evaluation of ChatGPT's ability to answer complex knowledge‑base questions by introducing a feature‑driven multi‑label classification framework and applying the CheckList black‑box testing methodology across eight KB‑based CQA datasets, comparing its performance with GPT‑3, GPT‑3.5 and FLAN‑T5.

DataFunTalk
DataFunTalk
DataFunTalk
Evaluating ChatGPT on Complex Knowledge‑Base Question Answering Using a Feature‑Driven Multi‑Label Framework

Abstract: ChatGPT, a powerful large language model (LLM), has shown impressive natural language understanding, but its performance on complex knowledge‑base question answering (KB‑CQA) requires thorough evaluation. We propose a framework that classifies potential features of complex questions and uses multiple tags to identify compositional reasoning, then assesses ChatGPT using CheckList‑style black‑box tests.

Methodology: We annotate 190,000 test cases from eight KB‑based CQA datasets (six English, two multilingual) with three tag types—answer type, reasoning type, and language. Evaluation follows CheckList’s three test categories: Minimum Functionality Tests (MFT), Invariance Tests (INV), and Directed Expectation Tests (DIR). For answer matching, we extract noun‑phrase candidates via constituency parsing, expand with Wikidata and WordNet aliases, and apply a similarity threshold before manual verification.

Datasets and Models: The datasets cover diverse reasoning operations (single‑hop, multi‑hop, set operations, counting, comparison, star‑shape) and answer types (DATE, LOC, PER, WHY, Boolean, MISC, NUM, UNA). We compare ChatGPT with GPT‑3, GPT‑3.5, and FLAN‑T5, and include current SOTA fine‑tuned and zero‑shot baselines.

Results: ChatGPT outperforms other LLMs on most datasets and surpasses SOTA on WQSP and GraphQuestions, but lags behind on entity‑rich datasets such as KQApro, LC‑quad2.0, and GrailQA. It excels in set‑operation and comparative reasoning, while struggling with numeric, causal, and temporal answers. In INV tests, ChatGPT shows ~79% reliability, but stability varies across reasoning types. DIR experiments reveal that answer‑type prompts help for Boolean and NUM questions, yet often fail for other types; CoT prompts improve performance, especially on COUNT and NUM tasks.

Conclusion: The analysis highlights both strengths and limitations of ChatGPT in complex KB‑CQA, providing insights for future LLM development and downstream research.

large language modelsChatGPTknowledge baseevaluationChecklistComplex QAMulti-label
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.