Artificial Intelligence 12 min read

From Bag‑of‑Words to ChatGPT: How Large Language Models Evolved

Tracing the evolution of large language models—from early bag‑of‑words techniques, through word embeddings, RNNs, attention mechanisms, Transformers, BERT, and GPT—this article explains each breakthrough, its limitations, and how they culminated in ChatGPT’s conversational AI.

Programmer Xu Shu

Jun 23, 2025

From Bag‑of‑Words to ChatGPT: How Large Language Models Evolved

What is a Large Language Model?

Large Language Model (LLM) is one of the most impressive achievements in AI over the past decade. It can read, understand, and even generate text, poetry, code, and stories. If you have chatted with ChatGPT, you are already using an LLM.

LLMs did not appear out of thin air; they are the result of decades of research and model evolution.

First Stage: Bag‑of‑Words (1950s‑2010s)

Bag of Words

Core idea: Treat a text as a bag containing words, ignoring order and semantics, and count word frequencies.

Workflow:

Tokenization: split a sentence into words, e.g., "我喜欢苹果" → ["我", "喜欢", "苹果"].

Build vocabulary: collect all unique words.

Count frequencies: compute how often each word appears.

Vectorize: represent the sentence as a numeric vector.

Example

Assume three sentences:

Sentence A: "我喜欢苹果。"

Sentence B: "我讨厌苹果。"

Sentence C: "苹果，我吃了它。"

Model defects

Ignores word order: cannot distinguish "我喜欢苹果" from "苹果喜欢我".

Ignores semantics: treats "喜欢" and "讨厌" as unrelated words.

High‑dimensional vectors become sparse and computationally heavy.

Second Stage: Word Vectors and Semantic Revolution (2013‑2017)

Word2Vec – Giving Machines a Sense of Meaning

In 2013, Google introduced Word2Vec, the first model that gave machines semantic awareness.

“You are what you often appear with.”

For example, "苹果" often appears with "吃", "甜", "水果"; "编程" appears with "代码", "算法", "计算机".

Example:

"苹果很好吃"

"香蕉很好吃"

Because the contexts are similar, the model learns: 向量("苹果") ≈ 向量("香蕉") Word2Vec also supports analogy reasoning:

向量("国王") - 向量("男人") + 向量("女人") ≈ 向量("女王")

Limitation of word vectors: static meanings

Word2Vec generates static vectors; a word’s meaning does not change with context.

Examples of polysemy:

In "我去了银行取钱", "bank" means a financial institution.

In "他坐在河岸边发呆", "bank" means a riverbank.

Word2Vec assigns the same vector to both senses, leading to errors.

Third Stage: RNN, Attention and Sequence Processing (2014‑2017)

RNN – Remembering the Past

Recurrent Neural Networks (RNN) were introduced to handle word order by retaining information from previous tokens.

For instance, when processing "我打了他", the model remembers "我" when it reaches "打", and retains "我打" when it reads "他", enabling correct interpretation.

RNN drawbacks

Processes one word at a time, leading to slow training.

Long sequences cause earlier information to be forgotten.

Attention Mechanism

Attention allows the model to focus on the most relevant parts of a sentence when interpreting a word.

“Attention lets the model dynamically focus on the most relevant parts of the sentence when understanding a word.”

For example, translating "我喜欢美洲驼" the model pays special attention to "美洲驼" and ignores unrelated words.

Attention later evolved into the Transformer architecture.

Fourth Stage: Transformer and Modern LLMs (2017‑present)

Transformer – The New Language Engine

In 2017, Google published “Attention is All You Need”, introducing the Transformer model that relies solely on attention, eliminating RNNs.

“Only attention, no RNN.”

Transformers enable parallel computation, capture long‑range dependencies, and dramatically improve efficiency, ushering in the era of modern LLMs.

BERT – Deep Understanding (2018)

BERT (Bidirectional Encoder Representations from Transformers) focuses on reading comprehension.

It is trained by masking random words in a sentence and asking the model to predict them, similar to a fill‑in‑the‑blank exercise.

Pre‑training: learn general language knowledge from massive text corpora.

Fine‑tuning: adapt the model to specific tasks such as sentiment analysis, QA, or search.

This approach made BERT the “understanding king” in many NLP tasks.

GPT – From Reading to Writing (2018‑present)

OpenAI kept only the decoder part of the Transformer to create a generative model, GPT (Generative Pre‑trained Transformer).

“Give me the first half of a sentence, I’ll guess the next word.”

Starting with GPT‑1 (117 M parameters), then GPT‑2 (1.5 B), and GPT‑3 (175 B), the model’s size grew exponentially, enabling it to write news, poetry, code, and more.

GPT’s ability to generate coherent text laid the foundation for ChatGPT and the development of “general language intelligence”.

Fifth Stage: ChatGPT and AI for Everyone (2022‑present)

Contextual memory for coherent multi‑turn dialogue.

Understanding natural‑language instructions.

Fluent, logical responses.

Key Technologies

Instruction Tuning – teaching the model to follow natural‑language commands.

RLHF (Reinforcement Learning from Human Feedback) – using human ratings to guide model outputs toward desired behavior.

Conclusion

From simple bag‑of‑words to AI that can write poetry and code, we have witnessed a remarkable leap in language intelligence. This progress expands human cognitive boundaries and invites collaboration rather than competition between humans and machines.

Transformer large language models ChatGPT natural language processing AI evolution

Written by

Programmer Xu Shu

Focused on Java backend development.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.