Artificial Intelligence 26 min read

From Bayesian Models to Generative Pre‑trained Transformers (GPT): A Brief History of Generative Learning

The article traces generative learning from its probabilistic roots in Bayesian classification, through Gaussian mixture models, hidden Markov models, N‑gram and neural language models, to attention mechanisms, Transformers and GPT, highlighting how each innovation expanded the ability to model data‑generating processes.

DeepHub IMBA

Jun 18, 2026

From Bayesian Models to Generative Pre‑trained Transformers (GPT): A Brief History of Generative Learning

Generative Learning vs. Discriminative Learning

Generative learning models the probability distribution that could have produced the observed data, answering "what process generated the data?" In contrast, discriminative learning models the conditional probability of a label given an input, answering "what is the label for this input?" Formally, discriminative learning estimates P(y|x) while generative learning estimates the joint distribution P(x, y) = P(x|y)·P(y).

1. Naïve Bayes: The Simplest Generative Classifier

Naïve Bayes, popularized in the 1990s, applies Bayes' theorem to infer hidden causes such as weather from observable clues (umbrella, wet shoes). The model assumes conditional independence of clues given the weather:

完整联合分布：
P(W, U, S) = P(W) · P(U, S | W)

线索的似然：
P(U, S | W) = P(U | W) · P(S | W)  （朴素贝叶斯独立性假设）

后验概率：
P(W | U, S) = (P(W) · P(U | W) · P(S | W)) / P(U, S)

The prior P(W) represents the weather probability, while the likelihood terms capture how likely the clues are under each weather condition.

2. Gaussian Mixture Models: Introducing Hidden Causes

Gaussian Mixture Models (GMMs) extend the idea by assuming each observation originates from one of several hidden components. The density is expressed as:

P(x) = Σₖ πₖ N(x | μₖ, Σₖ)

k  = number of hidden components
x  = observation
μₖ = mean of component k
Σₖ = covariance of component k
πₖ = weight of component k

Because the component parameters are unobserved, the Expectation‑Maximization (EM) algorithm is used for estimation.

Expectation‑Maximization (EM) Algorithm

E‑step: Compute the expected value of hidden variables (the probability that each component generated each observation) given current parameters.

M‑step: Maximize the expected log‑likelihood to update the parameters.

These steps repeat until the log‑likelihood improvement becomes negligible.

3. Hidden Markov Models and Linear Dynamical Systems

HMMs model sequences where a hidden state evolves over time and emits observations. Assuming the Markov property and emission independence, the joint probability of a weather‑state sequence Weather₁:T and observations Clues₁:T is:

P(Weather₁:T, Clues₁:T) = P(Weather₁)·P(Clues₁|Weather₁)·∏ₜ₌₂ᵀ P(Weatherₜ|Weatherₜ₋₁)·P(Cluesₜ|Weatherₜ)

The Baum‑Welch algorithm (a special case of EM) estimates the transition and emission probabilities.

4. N‑gram Language Models: Early Generative Text Modeling

Markov (1913) and Shannon (1948) laid the foundation for statistical language modeling. An N‑gram predicts the next word given the previous n‑1 words: P(wₜ | wₜ₋ₙ₊₁, …, wₜ₋₁) Bi‑gram ( P(wₜ|wₜ₋₁)) and tri‑gram ( P(wₜ|wₜ₋₂,wₜ₋₁)) are common; larger n suffer from data sparsity.

5. Neural Language Models

Neural networks replace exact count statistics with learned continuous word embeddings. A simple feed‑forward network for predicting the next word from three context words is illustrated:

假设词汇表包含 6 个词：{the, cat, sat, on, mat, dog}
输入上下文向量：x = [0.2,0.8, 0.9,0.1, 0.4,0.6]

h = σ(W₁x + b₁)      （隐藏层）

scores = W₂h + b₂   （输出层）

P(wₜ|input) = softmax(scores)

The loss is cross‑entropy L = - Σᵢ yᵢ log(pᵢ), and gradients ∂L/∂W₁, ∂L/∂b₁, ∂L/∂W₂, ∂L/∂b₂ are computed by back‑propagation.

6. Recurrent Neural Networks (RNN)

RNNs process sequences by maintaining a hidden state that evolves with each time step:

hₜ = f(Wₕhₜ₋₁ + Wₓxₜ + b)
yₜ = g(Wₒhₜ + bₒ)

Training uses back‑propagation through time. Variants such as LSTM and GRU add gates to capture longer‑range dependencies.

Unlike HMMs, RNN hidden dynamics are learned implicitly rather than explicitly modeled.

7. Attention Mechanism

Bahdanau et al. (2014) introduced attention to alleviate the encoder‑decoder bottleneck in sequence‑to‑sequence models. For each decoder step t, attention scores are computed:

sₜᵢ = a(dₜ, eᵢ)
a(dₜ, eᵢ) = vᵀ tanh(W_d dₜ + W_e eᵢ)
αₜᵢ = exp(sₜᵢ) / Σⱼ exp(sₜⱼ)
cₜ = Σᵢ αₜᵢ eᵢ

The context vector cₜ is a weighted sum of encoder states, allowing the decoder to attend to all input positions.

8. Transformer

Transformers replace recurrence with pure attention, enabling parallel computation. A self‑attention block computes: Attention(Q, K, V) = softmax(QKᵀ / √d_k) V Multi‑head attention applies several such blocks in parallel. Positional encodings inject order information. Causal masking prevents attending to future tokens during training.

9. Generative Pre‑trained Transformer (GPT)

OpenAI’s 2018 GPT paper applied a decoder‑only Transformer with causal masking to the same next‑token objective: P(x₁,…,xₙ) = ∏ₜ P(xₜ | x₁,…,xₜ₋₁) After pre‑training on massive text corpora, the model can be fine‑tuned for downstream tasks. GPT‑2 (2019) demonstrated zero‑shot capabilities on summarization, translation, and QA, confirming that scaling the core Transformer architecture yields increasingly general language understanding.

Conclusion

The narrative shows that generative learning is not merely about machines creating content; it is about learning the hidden causes that could generate observed data. From the simple Bayesian classifier, through mixture models, HMMs, N‑grams, neural networks, attention, and finally large‑scale Transformers, the core objective remains modeling the data‑generating distribution.

Other generative families such as Variational Auto‑encoders, GANs, and diffusion models follow the same principle of learning a distribution from which new samples can be drawn.

by Sanchayan Sarkar

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Transformer Generative Modeling Bayesian GPT Hidden Markov Model Gaussian Mixture Neural Language Model

Written by

DeepHub IMBA

A must‑follow public account sharing practical AI insights. Follow now. internet + machine learning + big data + architecture = IMBA

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.