Deep Dive into Forward vs Reverse KL Divergence: When to Use Which?
The article explains the definitions, properties, and asymmetric nature of KL divergence, compares Forward KL (mean‑seeking) and Reverse KL (mode‑seeking) through bimodal examples, and provides practical guidelines for choosing between them based on sampling and probability‑evaluation capabilities in machine‑learning tasks.
Definition and Properties
KL divergence measures the difference between two probability distributions P and Q. It is inherently asymmetric, so KL(P\|Q) ≠ KL(Q\|P). The divergence is non‑negative, equals zero only when the distributions are identical, and is finite only when the support of P is contained in the support of Q.
Forward KL: Mean‑Seeking (Mass‑Covering)
Minimizing Forward KL (FKL) is equivalent to minimizing the cross‑entropy loss, i.e., maximizing the log‑likelihood of samples drawn from the target distribution under the model distribution. This forces the model to assign high probability to all high‑probability regions of the target, leading to a “mean‑seeking” behavior: when the model’s expressive power is limited (e.g., fitting a unimodal Gaussian to a bimodal target), the optimum places its mass between the modes.
Reverse KL: Mode‑Seeking
Minimizing Reverse KL (RKL) samples from the model distribution and maximizes the probability of those samples under the target distribution. Because sampling is performed from the model, only regions where the model already places probability need to be covered by the target. This yields a “mode‑seeking” behavior: the model tends to collapse onto a single high‑probability mode of the target, ignoring other modes.
How to Choose Between Forward and Reverse KL
The choice depends on whether you can sample from the target distribution and/or evaluate (unnormalized) probabilities of samples under the target:
Use Forward KL when you can sample from the target but cannot compute its probability density (e.g., supervised learning, many generative models). In this case FKL reduces to maximum likelihood.
Use Reverse KL when you can compute (unnormalized) probabilities of samples under the target but cannot easily sample from it (e.g., variational inference, energy‑based models).
If both sampling and probability evaluation are possible (e.g., knowledge distillation where a teacher model provides both samples and probabilities), either KL can be used, and the better choice should be decided by empirical analysis of the specific task.
If neither operation is feasible, alternative modeling strategies are required.
Conclusion
The article presented Forward KL and Reverse KL, explained their asymmetric optimization objectives, illustrated their distinct behaviors on a bimodal target with a limited‑capacity Gaussian model, and offered concrete criteria for selecting the appropriate divergence based on the availability of sampling and probability‑evaluation mechanisms in a given machine‑learning scenario.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
