Complex Question Answering Evaluation of ChatGPT
This paper presents a large‑scale evaluation of ChatGPT on knowledge‑base complex question answering, introducing a feature‑driven multi‑label annotation framework and CheckList‑based functional, robustness, and controllability tests, and comparing its performance with other LLMs across multiple English and multilingual datasets.
ChatGPT, a powerful large language model, has achieved notable progress in natural language understanding, but its performance and limitations on complex knowledge‑base question answering (KB‑CQA) require thorough assessment. To evaluate ChatGPT as a potential replacement for traditional KBQA systems, the authors propose a framework that classifies potential features of complex questions and uses multiple tags to describe each test item, enabling the identification of combinatorial reasoning requirements.
The study highlights the growing interest in ChatGPT’s capabilities for KBQA, given its extensive coverage of resources like Wikipedia and strong natural language comprehension. Complex QA tasks demand multi‑hop reasoning, attribute comparison, set operations, and other sophisticated inference, making them a challenging benchmark for any QA model.
Building on prior work in prompt engineering, LLM evaluation, and black‑box testing (e.g., CheckList), the authors adopt a two‑stage evaluation framework. The first stage employs feature‑driven multi‑label annotation to capture answer type, reasoning type, and language attributes of each question. The second stage applies CheckList‑style tests—minimum functionality tests (MFT), invariance tests (INV), and directed expectation tests (DIR)—to measure functional correctness, robustness, and controllability, with Chain‑of‑Thought (CoT) prompting used to generate INV and DIR cases.
Eight KB‑based CQA datasets (six English, two multilingual) comprising roughly 190,000 questions are used, alongside comparative models GPT‑3, GPT‑3.5, FLAN‑T5, and current SOTA systems. The authors also release the annotated datasets and code publicly.
Experimental results show that ChatGPT generally outperforms the other LLMs on most datasets, achieving superior accuracy on set‑operation and comparative reasoning tasks, and demonstrating strong performance on low‑resource languages. However, it lags behind SOTA on numeric, temporal, and multi‑hop/star‑shaped reasoning, and its stability varies across different test types.
Detailed analysis of MFT, INV, and DIR reveals an overall reliability of about 79% for ChatGPT on complex QA, with CoT prompts notably improving performance on counting (NUM) questions. The findings underscore both the strengths and limitations of ChatGPT in complex reasoning scenarios.
In conclusion, the paper provides a comprehensive benchmark for assessing ChatGPT’s ability to answer complex questions using its internal knowledge, offering insights that can guide future development of large language models and downstream QA research.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.