Artificial Intelligence 10 min read

Topic Modeling Explained: pLSA, LDA, and How to Pick the Right Number of Topics

This article introduces the fundamentals of topic modeling, compares the probabilistic latent semantic analysis (pLSA) and latent Dirichlet allocation (LDA) methods, explains their graphical models and inference via EM or Gibbs sampling, and discusses practical strategies for selecting the optimal number of topics using perplexity or hierarchical Dirichlet processes.

Hulu Beijing

Jan 11, 2018

Topic Modeling Explained: pLSA, LDA, and How to Pick the Right Number of Topics

Introduction

Bag‑of‑Words and N‑gram representations cannot capture that different words may belong to the same topic. Topic models map words with the same theme to the same dimension, representing each document as a K‑dimensional topic vector where each dimension is the probability of the document belonging to a particular topic.

Problem Statement

What are the common topic models and their principles?

How to determine the number of topics in an LDA model?

Answer and Analysis

1. Common Topic Models

pLSA (probabilistic Latent Semantic Analysis)

pLSA assumes K topics and models the generation of each word w in a document d by first selecting a topic z and then generating w from the topic. The probability of a word given a document is:

p(w|d) = Σ_z p(z|d) p(w|z)

The likelihood of the whole corpus is maximized using the EM algorithm because the topic assignments are latent variables.

LDA (Latent Dirichlet Allocation)

LDA extends pLSA by placing Dirichlet priors α and β on the per‑document topic distribution θ and the per‑topic word distribution φ. This Bayesian treatment makes the parameters random variables and allows posterior inference via Gibbs sampling.

Gibbs sampling iteratively reassigns topics to words, eventually estimating θ and φ.

2. Determining the Number of Topics in LDA

The number of topics K is a hyper‑parameter. A common practice is to split the corpus into training, validation, and test sets (e.g., 60 %/20 %/20 %). Models with different K are trained on the training set, and their perplexity is evaluated on the validation set:

perplexity(D) = exp\{- (1/\sum_d N_d) \sum_d \sum_n log p(w_{dn}|model) \}

Perplexity typically decreases as K grows, then rises on the validation set due to over‑fitting. The K at the perplexity minimum or at the elbow point is chosen. Alternatively, non‑parametric models such as HDP‑LDA automatically infer the effective number of topics.

References

Hofmann, T. (1999). Probabilistic latent semantic analysis. UAI.

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. JMLR.

Teh, Y. W., et al. (2005). Hierarchical Dirichlet processes. NIPS.

George, E. I., & McCulloch, R. E. (1993). Variable selection via Gibbs sampling. JASA.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

topic modeling LDA perplexity pLSA

Written by

Hulu Beijing

Follow Hulu's official WeChat account for the latest company updates and recruitment information.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.