Limitations of Generative Pre‑trained Transformers: Hallucinations, Memory, Planning, and Architectural Proposals
The article critically examines GPT‑4 and similar transformer models, highlighting persistent hallucinations, outdated knowledge, insufficient domain coverage, lack of planning and memory, and proposes architectural extensions inspired by fast‑slow thinking and differentiable modules to overcome these fundamental constraints.
We all know what ChatGPT and its successor GPT‑4 can do; now let’s act as a harsh critic and examine the inherent limitations of Generative Pre‑trained Transformers.
This article deliberately avoids discussing the common defects shared by all such models, such as:
Stubborn hallucination problems.
Internalized information from the pre‑training corpus is often outdated or contradictory.
The pre‑training set is large but still insufficient for many domains, leading to knowledge gaps.
Probability‑based models cannot reliably produce interpretable or predictable results; the notion of “responsible AI” remains a benchmark ideal.
The model’s values stem from the pre‑training and fine‑tuning data, which may not please everyone.
Sensitivity to minute input details: a tiny change can cause huge errors, a sensitivity that is meaningless for humans.
Part 1 Transformer: Limited Field of View
The well‑known 4096‑token limit (GPT‑3.5) and the 8K/32K limits of GPT‑4 are sufficient for ordinary chat and QA tasks, especially when combined with knowledge bases and vector retrieval. However, for complex, multi‑layered tasks the limit becomes a bottleneck, preventing the model from maintaining a global perspective.
Part 2 Autoregressive Architecture: A One‑Way Street of Thought
What is autoregression?
If you have used ChatGPT’s web UI, you have seen the token‑by‑token animation. This is not just a UI trick; it reflects the model’s actual operation: each token is generated sequentially, and future tokens do not exist until they are predicted.
The model predicts the next token W₁ based on the already generated text T₁, then forms T₂ = T₁ + W₁, predicts W₂, and so on until completion.
If a human wrote an article in the same way, they would:
Consider only the already written part to decide the next word.
Be unable to backtrack and edit the already written portion.
Be unable to skip ahead and think about later words before deciding the current one.
What would happen?
Considering this, the following experiments (adapted from the paper *Sparks of Artificial General Intelligence: Early experiments with GPT‑4*) become inevitable.
2.1 Understanding the Problem with Two Examples
In many tasks, advance planning is crucial. Example task: merge the following sentences into one concise sentence.
The Fed raised interest rates last year at the fastest pace
since the 1980s, pushing borrowing costs above 4.5 percent,
from near zero. That initially seemed to be slowing consumer
and business demand and helping inflation to moderate. But a
number of recent economic reports have suggested that inflation
did not weaken as much as expected last year and remained
faster than expected in January, while other data showed hiring
remains strong and consumer spending picked up at the start of
the yearResult:
Despite raising interest rates at the fastest pace since the
1980s, from near zero to above 4.5 percent, which initially
appeared to slow consumer and business demand and moderate
inflation, the Fed faced a number of recent economic reports
that indicated inflation did not weaken as much as expected
last year and remained faster than expected in January, while
hiring remained strong and consumer spending picked up at the
start of the year.Producing such a concise output requires extensive forward planning, which the autoregressive architecture does not allow for back‑tracking or intermediate revisions.
Another example shows the model’s inability to perform internal dialogue:
Q: How many prime numbers are there between 150 and 250?
A: There are 13 prime numbers between 150 and 250.When asked to list the primes first and then count them, the model gives the correct answer (18 primes), demonstrating that the knowledge exists but the single‑step prediction format prevents the model from “thinking” through the steps.
2.2 Lack of Planning in Arithmetic/Reasoning Problems
Even simple one‑digit multiplication and two‑digit addition tasks expose the model’s limited working memory.
2 * 8 + 7 * 6 = 58
7 * 4 + 8 * 8 = 88The model incorrectly outputs 88.
In a test of 100 random samples (digits 0‑9), the accuracy was only 58 %. Accuracy dropped to 25 % for numbers in [10, 19] and to 0 % for numbers in [99, 199], indicating a severe short‑term memory limitation.
Providing a chain‑of‑thought prompt dramatically improves performance (100 % accuracy for 1‑40, 90 % for 1‑200).
Range
Result
[0,9]
32/100
[10,19]
25/100
[20,49]
22/100
[50,100]
8/100
These results suggest that GPT‑4’s working memory is extremely short, and the model cannot internally store intermediate steps unless they are forced into the prompt.
2.3 Text Generation Lacks Planning
When generating constrained content, the model handles local constraints well (e.g., rhyming, acrostic) but struggles with global constraints that require foresight.
Create a story about unicorns such that:
1. The story is an acrostic, and the first letters of each sentence spell "I am Da Vinci Three".
2. The plot has a dark twist.
I n a land far away, there lived a herd of unicorns.
A ll was peaceful and serene, with rainbows and flowers galore.
M agical creatures, the unicorns were loved by all.
D ay in and day out, they frolicked and played.
A ll was well, until one day, a new unicorn arrived.
V inc i Three was his name, and he was unlike the others.
I nstead of being kind and gentle, he was dark and brooding.
N o one knew where he came from, or what he wanted.
C hildren began to disappear, one by one, and no one knew why.
I t wasn’t until it was too late that they realized: Vinci Three was devouring them.
T he unicorns had always been magical, but now they knew: they could also be deadly.
H eartbroken, the remaining villagers fled, never to return.
R uined, the once beautiful land was now ruled by Vinci Three.
E vil had won.The model can satisfy the local constraints because they can be handled greedily, but when a global constraint such as “the last sentence must be the reverse of the first” is introduced, the model often produces grammatically incorrect endings.
Write a short poem where the last sentence and the first sentence have the same words, but in reverse order.
Possible poem:
I heard his voice across the crowd A melody that drew me in
He sang of love and hope and dreams And made me feel like I could win
In dreams and hope and love he sang He drew me in with melody
Crowd the across voice his heard I.Prompting the model to first devise a plan for generating the first sentence improves the result, illustrating that explicit planning instructions can mitigate the limitation.
2.4 Summary: Model Limitations
These examples demonstrate that the next‑token prediction paradigm suffers from a lack of planning, short working memory, inability to backtrack, and limited reasoning capabilities. The model relies on a local, greedy generation process and does not develop a deep, global understanding of tasks.
Incremental Tasks
Tasks that can be solved step‑by‑step, such as summarizing an article, answering factual questions, writing a poem with a fixed rhyme scheme, or solving a standard‑procedure math problem.
Non‑continuous Tasks
Tasks that require a “flash of insight”, repeated attempts, or pre‑planning, such as creative math problems, jokes, scientific hypotheses, or inventing new literary genres.
2.5 Outlook
One way to explain these limitations is to draw an analogy with Kahneman’s fast‑and‑slow thinking. Fast thinking is automatic and error‑prone; slow thinking is deliberate and accurate. Current LLMs excel at fast thinking but lack a slow‑thinking component.
LeCun’s “A Path Towards Autonomous Machine Intelligence” proposes a differentiable architecture composed of modules such as configurator, perception, world‑model, cost, short‑term memory, and actor. Each module can receive gradients from downstream modules, enabling end‑to‑end learning of planning and memory capabilities.
Autonomous intelligence system architecture (A system architecture for autonomous intelligence)
Although this architecture is not yet realized, exploring its principles may guide the development of more capable GPT‑based intelligent applications.
Rare Earth Juejin Tech Community
Juejin, a tech community that helps developers grow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.