From Bag‑of‑Words to ChatGPT: How Large Language Models Evolved
Tracing the evolution of large language models—from early bag‑of‑words techniques, through word embeddings, RNNs, attention mechanisms, Transformers, BERT, and GPT—this article explains each breakthrough, its limitations, and how they culminated in ChatGPT’s conversational AI.
What is a Large Language Model?
Large Language Model (LLM) is one of the most impressive achievements in AI over the past decade. It can read, understand, and even generate text, poetry, code, and stories. If you have chatted with ChatGPT, you are already using an LLM.
LLMs did not appear out of thin air; they are the result of decades of research and model evolution.
First Stage: Bag‑of‑Words (1950s‑2010s)
Bag of Words
Core idea: Treat a text as a bag containing words, ignoring order and semantics, and count word frequencies.
Workflow:
Tokenization: split a sentence into words, e.g., "我喜欢苹果" → ["我", "喜欢", "苹果"].
Build vocabulary: collect all unique words.
Count frequencies: compute how often each word appears.
Vectorize: represent the sentence as a numeric vector.
Example
Assume three sentences:
Sentence A: "我喜欢苹果。"
Sentence B: "我讨厌苹果。"
Sentence C: "苹果,我吃了它。"
Model defects
Ignores word order: cannot distinguish "我喜欢苹果" from "苹果喜欢我".
Ignores semantics: treats "喜欢" and "讨厌" as unrelated words.
High‑dimensional vectors become sparse and computationally heavy.
Second Stage: Word Vectors and Semantic Revolution (2013‑2017)
Word2Vec – Giving Machines a Sense of Meaning
In 2013, Google introduced Word2Vec, the first model that gave machines semantic awareness.
“You are what you often appear with.”
For example, "苹果" often appears with "吃", "甜", "水果"; "编程" appears with "代码", "算法", "计算机".
Example:
"苹果很好吃"
"香蕉很好吃"
Because the contexts are similar, the model learns: 向量("苹果") ≈ 向量("香蕉") Word2Vec also supports analogy reasoning:
向量("国王") - 向量("男人") + 向量("女人") ≈ 向量("女王")Limitation of word vectors: static meanings
Word2Vec generates static vectors; a word’s meaning does not change with context.
Examples of polysemy:
In "我去了银行取钱", "bank" means a financial institution.
In "他坐在河岸边发呆", "bank" means a riverbank.
Word2Vec assigns the same vector to both senses, leading to errors.
Third Stage: RNN, Attention and Sequence Processing (2014‑2017)
RNN – Remembering the Past
Recurrent Neural Networks (RNN) were introduced to handle word order by retaining information from previous tokens.
For instance, when processing "我打了他", the model remembers "我" when it reaches "打", and retains "我打" when it reads "他", enabling correct interpretation.
RNN drawbacks
Processes one word at a time, leading to slow training.
Long sequences cause earlier information to be forgotten.
Attention Mechanism
Attention allows the model to focus on the most relevant parts of a sentence when interpreting a word.
“Attention lets the model dynamically focus on the most relevant parts of the sentence when understanding a word.”
For example, translating "我喜欢美洲驼" the model pays special attention to "美洲驼" and ignores unrelated words.
Attention later evolved into the Transformer architecture.
Fourth Stage: Transformer and Modern LLMs (2017‑present)
Transformer – The New Language Engine
In 2017, Google published “Attention is All You Need”, introducing the Transformer model that relies solely on attention, eliminating RNNs.
“Only attention, no RNN.”
Transformers enable parallel computation, capture long‑range dependencies, and dramatically improve efficiency, ushering in the era of modern LLMs.
BERT – Deep Understanding (2018)
BERT (Bidirectional Encoder Representations from Transformers) focuses on reading comprehension.
It is trained by masking random words in a sentence and asking the model to predict them, similar to a fill‑in‑the‑blank exercise.
Pre‑training: learn general language knowledge from massive text corpora.
Fine‑tuning: adapt the model to specific tasks such as sentiment analysis, QA, or search.
This approach made BERT the “understanding king” in many NLP tasks.
GPT – From Reading to Writing (2018‑present)
OpenAI kept only the decoder part of the Transformer to create a generative model, GPT (Generative Pre‑trained Transformer).
“Give me the first half of a sentence, I’ll guess the next word.”
Starting with GPT‑1 (117 M parameters), then GPT‑2 (1.5 B), and GPT‑3 (175 B), the model’s size grew exponentially, enabling it to write news, poetry, code, and more.
GPT’s ability to generate coherent text laid the foundation for ChatGPT and the development of “general language intelligence”.
Fifth Stage: ChatGPT and AI for Everyone (2022‑present)
Contextual memory for coherent multi‑turn dialogue.
Understanding natural‑language instructions.
Fluent, logical responses.
Key Technologies
Instruction Tuning – teaching the model to follow natural‑language commands.
RLHF (Reinforcement Learning from Human Feedback) – using human ratings to guide model outputs toward desired behavior.
Conclusion
From simple bag‑of‑words to AI that can write poetry and code, we have witnessed a remarkable leap in language intelligence. This progress expands human cognitive boundaries and invites collaboration rather than competition between humans and machines.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
