Topic Modeling Explained: pLSA, LDA, and How to Pick the Right Number of Topics
This article introduces the fundamentals of topic modeling, compares the probabilistic latent semantic analysis (pLSA) and latent Dirichlet allocation (LDA) methods, explains their graphical models and inference via EM or Gibbs sampling, and discusses practical strategies for selecting the optimal number of topics using perplexity or hierarchical Dirichlet processes.
Introduction
Bag‑of‑Words and N‑gram representations cannot capture that different words may belong to the same topic. Topic models map words with the same theme to the same dimension, representing each document as a K‑dimensional topic vector where each dimension is the probability of the document belonging to a particular topic.
Problem Statement
What are the common topic models and their principles?
How to determine the number of topics in an LDA model?
Answer and Analysis
1. Common Topic Models
pLSA (probabilistic Latent Semantic Analysis)
pLSA assumes K topics and models the generation of each word w in a document d by first selecting a topic z and then generating w from the topic. The probability of a word given a document is:
p(w|d) = Σ_z p(z|d) p(w|z)The likelihood of the whole corpus is maximized using the EM algorithm because the topic assignments are latent variables.
LDA (Latent Dirichlet Allocation)
LDA extends pLSA by placing Dirichlet priors α and β on the per‑document topic distribution θ and the per‑topic word distribution φ. This Bayesian treatment makes the parameters random variables and allows posterior inference via Gibbs sampling.
Gibbs sampling iteratively reassigns topics to words, eventually estimating θ and φ.
2. Determining the Number of Topics in LDA
The number of topics K is a hyper‑parameter. A common practice is to split the corpus into training, validation, and test sets (e.g., 60 %/20 %/20 %). Models with different K are trained on the training set, and their perplexity is evaluated on the validation set:
perplexity(D) = exp\{- (1/\sum_d N_d) \sum_d \sum_n log p(w_{dn}|model) \}Perplexity typically decreases as K grows, then rises on the validation set due to over‑fitting. The K at the perplexity minimum or at the elbow point is chosen. Alternatively, non‑parametric models such as HDP‑LDA automatically infer the effective number of topics.
References
Hofmann, T. (1999). Probabilistic latent semantic analysis. UAI.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. JMLR.
Teh, Y. W., et al. (2005). Hierarchical Dirichlet processes. NIPS.
George, E. I., & McCulloch, R. E. (1993). Variable selection via Gibbs sampling. JASA.
Hulu Beijing
Follow Hulu's official WeChat account for the latest company updates and recruitment information.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.