Why Bigger LLMs Aren’t Smarter: Karpathy Blames Junk Training Data
Karpathy argues that the rapid growth of large language models is driven more by noisy, low‑quality training data than by a need for greater intelligence, urging a split between clean cognition cores and external memory to achieve smarter, more efficient AI.
Karpathy recently made a counter‑intuitive claim: the reason today’s large language models keep getting bigger is not that intelligence requires more parameters, but that training data is dirty and noisy.
In other words, model bloat is paying for junk data rather than genuine cognitive capability.
Typical internet content we imagine—Wall Street Journal articles, Wikipedia entries—doesn’t reflect the reality of pre‑training corpora. Random samples from cutting‑edge labs often contain stock symbols, broken HTML, spam, and nonsensical text.
Studies estimate Llama 3’s information compression rate at only 0.07 bits per token, meaning the model retains only a vague shadow of most of what it has seen.
Consequently, building trillion‑parameter models may be less about needing a "trillion‑parameter brain" and more about creating a massive compression engine that squeezes useful intelligence out of a noisy data flood.
Thus, most parameters perform "memory work" rather than genuine reasoning.
If this view holds, the next step isn’t blindly adding parameters but separating "cognition" from "memory."
Karpathy proposes a clean split: a "cognitive core" that keeps only reasoning and problem‑solving algorithms, and an external memory that fetches facts on demand instead of embedding everything in weights.
He predicts that a high‑quality‑data‑trained cognitive core could achieve strong intelligence with roughly 1 billion parameters.
Compare today’s flagship models—ranging from 200 billion to 18 trillion parameters—where a large portion of weights likely just memorize low‑quality internet noise.
The trend already supports this view: GPT‑4o, at about 200 billion parameters, surpasses the performance of the original 18‑trillion‑parameter GPT‑4. Moreover, from 2022 to 2024, inference cost for GPT‑3.5‑level performance dropped 280×, driven mainly by smaller, cleaner, and better‑architected models.
This explains the industry’s shifting focus: the competitive edge will no longer be who can stack the most parameters, but who can cleanly separate cognition and memory.
Future breakthroughs will hinge on smarter system design rather than sheer scale.
Reference: MilkRoad AI on X https://x.com/MilkRoadAI/status/2045484064585728489
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
