Demystifying Entropy: From Basic Concepts to Cross‑Entropy and KL Divergence

This article explains entropy, joint entropy, conditional entropy, and related measures such as KL divergence and cross‑entropy, using intuitive coin‑flip examples and mathematical formulas to show how they quantify uncertainty and information in probability distributions.

21CTO
21CTO
21CTO
Demystifying Entropy: From Basic Concepts to Cross‑Entropy and KL Divergence
Please listen to the question: What is entropy? What is cross‑entropy? What is joint entropy? What is conditional entropy? What is relative entropy? What are their relationships and differences?

If you find answering these questions challenging, this article is for you.

1. Starting from random variables

Consider a coin toss where the upward face after the toss is denoted by y. The value y is uncertain and can be either heads or tails. Similar uncertain variables include a person's height z, which varies across individuals. Such uncertain variables are called random variables, and probability distributions are the most powerful tool to describe them.

2. What is entropy?

Entropy characterizes the uncertainty of a probability distribution. For example, a coin that lands heads with probability 0.5 is more uncertain than one with probability 0.8. The distribution with higher probability provides more information, leading to lower entropy. Entropy quantifies this uncertainty.

3. Mathematical expression of entropy

For a single outcome with probability P, the uncertainty is expressed as -log P. Extending to an entire distribution, entropy is the expected value of -log P, i.e., -∑ P log P.

4. Entropy of a Bernoulli distribution

For a coin with head probability p, the entropy is H(p) = -p log p - (1‑p) log (1‑p). The entropy reaches its maximum at p = 0.5 and approaches zero as p → 0 or p → 1.

5. Joint entropy

Given two random variables X and Y with joint distribution p(x, y), the joint entropy is H(X, Y) = -∑ p(x, y) log p(x, y). Joint entropy is always ≥ the individual entropies, and equals an individual entropy when the other variable is deterministic.

6. Conditional entropy

Conditional entropy measures the additional uncertainty of X given Y: H(X|Y) = H(X, Y) - H(Y). It satisfies H(X|Y) ≤ H(X), with equality when X and Y are independent. This concept underlies information gain and mutual information.

7. Relative entropy (KL divergence)

KL divergence D(q‖p) quantifies how a true distribution q differs from an estimated distribution p. It is derived from the maximum‑likelihood principle and equals D(q‖p) = -∑ p(x) log q(x) - H(p). When q = p, the divergence is zero.

8. Cross‑entropy

Cross‑entropy is simply the term -∑ p(x) log q(x). It differs from KL divergence by the addition of the entropy of p, and it is not symmetric: the cross‑entropy of p relative to q differs from that of q relative to p.

9. No summary, no progress

The author shares personal insights on these entropy concepts, hoping the explanation helps readers and invites feedback.

Source: https://www.jianshu.com/p/09b70253c840

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

machine learningprobabilityentropyinformation theoryKL divergencecross entropy
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.