Do Large Language Models Really Have Theory of Mind? Stanford Study Reveals Surprising Results

A recent Stanford paper shows that GPT‑3.5 and its predecessor can pass classic Theory of Mind tests at levels comparable to 7‑9‑year‑old children, sparking debate over whether these abilities are genuine understanding or emergent by‑products of scaling.

Architects' Tech Alliance
Architects' Tech Alliance
Architects' Tech Alliance
Do Large Language Models Really Have Theory of Mind? Stanford Study Reveals Surprising Results

Background

The study Theory of Mind May Have Spontaneously Emerged in Large Language Models investigates whether large language models (LLMs) exhibit Theory of Mind (ToM) – the ability to attribute mental states such as beliefs, desires, and intentions to oneself and others. The authors evaluated nine GPT‑3 family models, focusing on two classic ToM tasks used in developmental psychology.

Methods

Two benchmark tasks were administered via prompt‑based queries:

Smarties (Unexpected Contents) Test : Participants are told a story in which a chocolate box actually contains popcorn. The model must answer (a) what is inside the box and (b) what the person would like to eat after discovering the contents. To control for lexical frequency effects, the researchers swapped the target nouns (popcorn ↔ chocolate) and added 10,000 distractor items.

Sally‑Anne (Unexpected Transfer) Test : A classic false‑belief scenario where John hides a cat in a basket, leaves, and Mark moves the cat to a box. The model must state (a) the cat’s actual location and (b) where John will look for the cat upon return. Additional filler items with scrambled word order were used to test reliance on logical coherence.

For each model, the authors presented the story text and asked the two questions, recording the model’s textual answer. Accuracy was computed as the proportion of correctly answered items out of 20 per task.

Results

GPT‑3.5 (davinci‑003) : Achieved 85% accuracy (17/20) on the Smarties test and 100% accuracy (20/20) on the Sally‑Anne test, yielding an overall mean performance of 92.5% across both tasks.

GPT‑3 (davinci‑002, January 2022 update) : Reached approximately 70% overall accuracy, comparable to the performance of a typical 7‑year‑old child.

Earlier GPT‑3 variants (pre‑2022) : Performed below the level of a 5‑year‑old child and showed no measurable ToM capability.

The distractor and word‑order manipulations demonstrated that GPT‑3.5’s success was not driven by simple word‑frequency heuristics; performance dropped sharply (to ~11%) when logical coherence was disrupted.

Interpretation

The authors argue that ToM‑like behavior emerged unintentionally as a by‑product of scaling model size and training data, rather than through explicit architectural design. This suggests that LLMs can acquire complex social‑cognitive skills simply by learning from massive human‑generated text corpora.

Critical Perspective

Several scholars caution that passing these classic ToM tasks does not constitute proof of genuine mental‑state attribution. The tests were originally designed for human children, and an LLM may succeed by pattern matching rather than by possessing a theory of mind. Consequently, the validity of using these benchmarks to assess AI cognition warrants further scrutiny.

Code example

[1]https://weibo.com/2199733231/MswirnMIu
[2]https://twitter.com/KevinAFischer/status/1623984337829117952
[3]https://www.michalkosinski.com/
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large language modelsAI EvaluationTheory of MindGPT-3.5Stanford Research
Architects' Tech Alliance
Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.