Artificial Intelligence 14 min read

Understanding Entropy, Joint Entropy, Conditional Entropy, Relative Entropy, and Cross Entropy

This article explains the concepts of entropy, joint entropy, conditional entropy, relative entropy (KL divergence) and cross‑entropy, illustrating their definitions, mathematical formulas, intuitive interpretations, and relationships through simple probability examples and visual diagrams.

Architecture Digest

Feb 3, 2018

Understanding Entropy, Joint Entropy, Conditional Entropy, Relative Entropy, and Cross Entropy

The article begins by posing the questions: what are entropy, cross‑entropy, joint entropy, conditional entropy, and relative entropy, and how are they related?

It introduces random variables using a coin‑toss example, explaining that the outcome y is uncertain and that such uncertain variables are called random variables, whose behavior is described by probability distributions.

Entropy is presented as a quantitative measure of the uncertainty of a probability distribution; the intuition is illustrated with two coins having head probabilities 0.5 and 0.8, showing that the more biased coin has lower uncertainty.

The mathematical expression of entropy is given as -∑ P log P, accompanied by a plot of –log P versus P.

For a Bernoulli distribution the entropy is H(p) = -p log p -(1-p) log (1-p). The article shows that entropy is maximal at p = 0.5 and approaches zero as p approaches 0 or 1, and compares this behavior with the variance of the distribution.

Joint entropy is defined as H(X,Y) = -∑ p(x,y) log p(x,y). An intuitive argument explains why joint entropy is always greater than or equal to the individual entropies, with a diagram of the joint distribution of two independent Bernoulli variables.

Conditional entropy is introduced as H(X|Y) = H(X,Y) - H(Y). The article discusses the inequality H(X|Y) ≤ H(X), its equality condition when X and Y are independent, and connects the reduction in uncertainty to information gain (mutual information).

Relative entropy (KL divergence) is described as D(q‖p) = -∑ p(x) log q(x) - H(p). It is interpreted as a measure of how close the estimated true distribution q is to the empirical distribution p, becoming zero when the two distributions coincide.

Cross‑entropy is defined as the term -∑ p(x) log q(x), noting that it is not symmetric (i.e., cross‑entropy of p with q differs from that of q with p).

The author concludes with a personal note encouraging readers to critique and improve the explanations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

machine learning statistics probability entropy information theory KL divergence cross entropy

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.