Artificial Intelligence 10 min read

Topic Modeling Explained: pLSA, LDA, and How to Pick the Right Number of Topics

This article introduces the fundamentals of topic modeling, compares the probabilistic latent semantic analysis (pLSA) and latent Dirichlet allocation (LDA) methods, explains their graphical models and inference via EM or Gibbs sampling, and discusses practical strategies for selecting the optimal number of topics using perplexity or hierarchical Dirichlet processes.

Hulu Beijing
Hulu Beijing
Hulu Beijing
Topic Modeling Explained: pLSA, LDA, and How to Pick the Right Number of Topics

Introduction

Bag‑of‑Words and N‑gram representations cannot capture that different words may belong to the same topic. Topic models map words with the same theme to the same dimension, representing each document as a K‑dimensional topic vector where each dimension is the probability of the document belonging to a particular topic.

Problem Statement

What are the common topic models and their principles?

How to determine the number of topics in an LDA model?

Answer and Analysis

1. Common Topic Models

pLSA (probabilistic Latent Semantic Analysis)

pLSA assumes K topics and models the generation of each word w in a document d by first selecting a topic z and then generating w from the topic. The probability of a word given a document is:

p(w|d) = Σ_z p(z|d) p(w|z)
pLSA graphical model
pLSA graphical model

The likelihood of the whole corpus is maximized using the EM algorithm because the topic assignments are latent variables.

LDA (Latent Dirichlet Allocation)

LDA extends pLSA by placing Dirichlet priors α and β on the per‑document topic distribution θ and the per‑topic word distribution φ. This Bayesian treatment makes the parameters random variables and allows posterior inference via Gibbs sampling.

LDA graphical model
LDA graphical model

Gibbs sampling iteratively reassigns topics to words, eventually estimating θ and φ.

2. Determining the Number of Topics in LDA

The number of topics K is a hyper‑parameter. A common practice is to split the corpus into training, validation, and test sets (e.g., 60 %/20 %/20 %). Models with different K are trained on the training set, and their perplexity is evaluated on the validation set:

perplexity(D) = exp\{- (1/\sum_d N_d) \sum_d \sum_n log p(w_{dn}|model) \}
Perplexity formula
Perplexity formula

Perplexity typically decreases as K grows, then rises on the validation set due to over‑fitting. The K at the perplexity minimum or at the elbow point is chosen. Alternatively, non‑parametric models such as HDP‑LDA automatically infer the effective number of topics.

References

Hofmann, T. (1999). Probabilistic latent semantic analysis. UAI.

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. JMLR.

Teh, Y. W., et al. (2005). Hierarchical Dirichlet processes. NIPS.

George, E. I., & McCulloch, R. E. (1993). Variable selection via Gibbs sampling. JASA.

Machine Learningtopic modelingLDAperplexitypLSA
Hulu Beijing
Written by

Hulu Beijing

Follow Hulu's official WeChat account for the latest company updates and recruitment information.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.