Evaluating the Commonsense Knowledge and Reasoning Capabilities of ChatGPT and Other Large Language Models

This study systematically evaluates ChatGPT and other large language models on their ability to answer commonsense questions, assess their knowledge awareness, and utilize generated knowledge for reasoning, revealing strong QA performance but notable gaps in social and temporal commonsense and in leveraging contextual knowledge.

Architect
Architect
Architect
Evaluating the Commonsense Knowledge and Reasoning Capabilities of ChatGPT and Other Large Language Models

Recent advances in large language models (LLMs) such as GPT‑3, ChatGPT, and GPT‑4 have achieved impressive NLP performance, yet their capacity to remember, represent, and apply commonsense knowledge remains uncertain.

The paper investigates four key questions: (1) Can GPTs effectively answer commonsense questions? (2) Do they possess extensive commonsense knowledge? (3) Are they aware of the specific commonsense required for a given question? (4) Can they efficiently use commonsense to answer questions?

To answer these, the authors conduct experiments on eleven commonsense QA datasets covering eight domains (general, physical, social, scientific, event, numerical, prototype, and temporal). Models evaluated include GPT‑3 (davinci) with 4‑shot prompting, GPT‑3.5, and ChatGPT with zero‑shot prompting.

Results show that GPT‑3.5 and ChatGPT achieve high QA accuracy on most datasets, with especially strong performance on ARC and ProtoQA (≈94%). However, they struggle with social, event, and temporal commonsense, often scoring below 70%.

Further analysis examines whether models can identify the necessary knowledge for answering a question. Manual evaluation of generated knowledge reveals that ChatGPT frequently produces noisy or over‑generalized knowledge, achieving average precision of ~56% and recall of ~84%, indicating it can recognize most required knowledge but cannot pinpoint the essential pieces.

When the generated knowledge is added back into the prompt, ChatGPT does not consistently improve its answers; performance gains are limited and sometimes reverse, suggesting the model cannot effectively leverage its own generated commonsense.

The authors conclude that while ChatGPT is knowledge‑rich, it behaves as an inexperienced problem‑solver lacking self‑awareness of required commonsense, and it fails to efficiently use contextual knowledge for reasoning. Future work should focus on better knowledge‑aware mechanisms, targeted injection of missing social and temporal commonsense, and more comprehensive evaluation benchmarks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large language modelsChatGPTevaluationNLPcommonsense reasoningknowledge awareness
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.