Information Theory Foundations for Machine Learning and Deep Learning
The article explains Shannon information content, entropy, cross‑entropy, KL‑divergence, conditional entropy and mutual information, illustrating each concept with coin‑flip and dice examples, visual formulas, and discusses their roles as loss functions and evaluation metrics in machine‑learning models.
Shannon Information Content
Shannon information content measures the amount of information gained when an event X occurs. If a coin always shows heads, the event is predictable and the information content is zero. For a fair coin, the distribution is uniform and the outcome is maximally unpredictable, giving the highest information content; a uniform distribution yields the maximum content.
In computer science, information content can be viewed as the minimal number of bits needed to encode the information efficiently. Using event frequencies, the optimal code assigns 0b0 to heads and 0b1 to tails for a fair coin. With base‑2 logarithms, the information content of a fair‑coin flip is -log₂(½) = 1 bit.
If X stores values drawn from a random process, such as dice faces or the total number of heads in 20 coin flips, its values can be modeled with a distribution p(x). For example, the sum of ten dice rolls can be approximated by a Gaussian distribution according to the central limit theorem.
Information Entropy
Entropy H measures the expected information content of a random variable. It is calculated by summing the information content of each outcome weighted by its probability.
For a fair coin, p(head)=½ and p(tail)=½, so its entropy equals 1 bit, meaning on average one bit is needed to encode each flip.
Entropy therefore establishes a lower bound on the average number of bits required to encode events drawn from a given probability distribution.
Cross Entropy
Cross entropy H(P, Q) measures the expected number of bits needed to encode X when using a coding scheme optimized for distribution Q while the true distribution is P.
In machine learning we aim for the model distribution Q to match the ground‑truth distribution P. When they match, cross entropy reaches its minimum and is often used as a training objective.
KL Divergence
KL divergence quantifies the difference between two distributions P and Q. It can be expressed as the difference between cross entropy and entropy, showing that KL divergence measures the extra bits required when encoding P with a sub‑optimal code based on Q. KL divergence is always non‑negative.
Optimizing KL divergence with respect to model parameters is equivalent to optimizing cross entropy because entropy does not depend on the model.
KL(p, q) ≥ 0
KL(p, p) = 0
KL(p, q) ≠ KL(q, p) (asymmetry)
The asymmetry has important implications. If the ground‑truth p is a bimodal distribution and we model it with a unimodal Gaussian q, minimizing KL(p, q) yields a Gaussian that covers both modes, whereas minimizing the reverse KL(q, p) leads to a Gaussian that covers only one mode, producing local optima.
Conditional Entropy
Conditional entropy H(Y|X) is the entropy of Y given that X is known. If Y can be perfectly separated based on X, the conditional entropy equals the weighted sum of the entropies of the resulting groups.
Mutual Information (Information Gain)
Mutual information I(X; Y) quantifies how much information about X is obtained by observing Y. Intuitively, if Y provides complete information about X, conditional entropy H(X|Y) becomes zero and mutual information equals H(X). Conversely, if Y provides no information about X, conditional entropy equals H(X) and mutual information is zero.
For example, knowing an object's label (Y) gives a lot of information about its raw image (X), resulting in high mutual information. In decision trees, splits are chosen to maximize mutual information, and in models like InfoGAN, mutual information is maximized between generated images and their intended labels.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Code DAO
We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
