Fundamentals 11 min read

What Is Information Entropy? Definition, Calculation, and Decision Tree Applications

This article explains the concept of information entropy, its mathematical definition, how to compute it for simple events like coin flips and dice rolls, and demonstrates its role in decision‑tree learning by calculating information gain for a watermelon dataset.

Model Perspective

Sep 12, 2023

When I teach complex concepts, students sometimes say, "Teacher, this class has too much information." That signals they may not have understood, reminding us of the definition of "information quantity" and how to measure it. In everyday life we hear the word "information" everywhere, but how can we quantify its amount or "weight"? This requires the concept of "information entropy".

What Is Information Entropy?

Information entropy was first introduced by Claude Shannon in his 1948 paper "A Mathematical Theory of Communication" as a way to quantify the uncertainty or randomness of information. The term "entropy" originally comes from thermodynamics, describing the disorder of a system. In information theory, it measures the uncertainty of information. Simply put, higher entropy means greater uncertainty, while lower entropy means greater certainty.

How to Calculate Information Entropy?

Assume we have a source that can emit various messages, each with a probability \(p_i\). The entropy \(H\) of the source is defined as:

\[ H = -\sum_{i} p_i \log_2 p_i \] where the summation runs over all possible messages.

Why is this formula reasonable? Consider the following points:

The method considers the probability distribution of all possible events , assigning a weight to each event to quantify the amount of information gained when that event occurs. If an event is certain (probability 1 or 0), its information content is zero because no new information is obtained. If we know something will definitely happen, we gain no new information when it does.

Entropy is a symmetric function of the probability distribution, meaning that relabeling or swapping events does not change its value, aligning with the notion that the amount of information does not depend on the names of the events.

If two independent random sources are combined, the total entropy equals the sum of their individual entropies. This matches our intuition of adding information from separate sources.

Entropy as a function of probabilities is continuous, so small changes in probabilities lead to small changes in entropy.

Given a constraint (e.g., a fixed probability distribution), the uniform distribution yields the maximum entropy. When all events are equally likely, uncertainty is greatest.

For example, tossing a fair coin (heads and tails each with probability 0.5) yields an entropy of 1 bit.

The unit of entropy is the bit, short for "binary digit". If an event has two equally likely outcomes, we obtain 1 bit of information regardless of which outcome occurs.

Consider a fair six‑sided die where each face has probability \(1/6\). The entropy is:

\[ H = -6 \times \frac{1}{6} \log_2 \frac{1}{6} \approx 2.585 \text{ bits} \] Thus, observing a die roll provides about 2.585 bits of information, more than a coin flip because the outcome is less certain.

The Significance of Information Entropy

Why compute entropy? It helps us understand the value of information. If an event is almost certain, such as the sun rising in the east, its entropy is low because it carries little new information. Rare or uncertain events have high information value.

Entropy also plays a crucial role in many practical applications, including data compression, cryptography, and machine learning. In decision‑tree algorithms, entropy is used to determine the best splitting attribute. Specifically, we aim to find the attribute that maximally reduces uncertainty, known as "information gain".

Decision Tree and Information Gain

Take a classic example of classifying watermelons. The dataset contains attributes "color" (light or dark) and "sound" (loud or quiet) and a label "sweet" (yes or no).

First, we compute the overall entropy based on the "sweet" label. Then, for each attribute, we calculate the weighted entropy of the subsets created by splitting on that attribute. The information gain is the difference between the original entropy and the weighted entropy after the split.

By calculating the information gain for "color" and for "sound", we find that the color attribute provides a larger gain, indicating it is more useful for distinguishing sweet from non‑sweet melons.

# Given data
melons = [
    {"color": "light", "sound": "loud", "sweet": False},
    {"color": "light", "sound": "quiet", "sweet": False},
    {"color": "dark", "sound": "loud", "sweet": True},
    {"color": "dark", "sound": "quiet", "sweet": True},
    {"color": "light", "sound": "quiet", "sweet": False},
    {"color": "dark", "sound": "loud", "sweet": True}
]

# Calculate the entropy for a given set of probabilities
def entropy(probs):
    return sum([-p * math.log2(p) for p in probs if p > 0])

# Calculate the entropy of the dataset for the "sweet" attribute
def dataset_entropy(dataset):
    total = len(dataset)
    sweet_count = sum(1 for melon in dataset if melon["sweet"])
    probs = [sweet_count / total, 1 - sweet_count / total]
    return entropy(probs)

# Calculate the weighted entropy of the dataset for a specific attribute (color or sound)
def weighted_entropy(dataset, attribute):
    attribute_values = {melon[attribute] for melon in dataset}
    total_entropy = 0
    for value in attribute_values:
        subset = [melon for melon in dataset if melon[attribute] == value]
        weight = len(subset) / len(dataset)
        total_entropy += weight * dataset_entropy(subset)
    return total_entropy

# Calculate the information gain for a specific attribute (color or sound)
def information_gain(dataset, attribute):
    return dataset_entropy(dataset) - weighted_entropy(dataset, attribute)

# Calculate the information gains for color and sound
info_gain_color = information_gain(melons, "color")
info_gain_sound = information_gain(melons, "sound")

info_gain_color, info_gain_sound

Conclusion

Information entropy is a powerful mathematical model that helps us quantify uncertainty. By calculating entropy, we can better assess the value of information and achieve superior results in various technical applications. In the information age, understanding entropy is like having a ruler to measure information —so start using it to measure your data today!

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

machine learning Decision Tree information theory Information Gain information entropy

Written by

Model Perspective

Insights, knowledge, and enjoyment from a mathematical modeling researcher and educator. Hosted by Haihua Wang, a modeling instructor and author of "Clever Use of Chat for Mathematical Modeling", "Modeling: The Mathematics of Thinking", "Mathematical Modeling Practice: A Hands‑On Guide to Competitions", and co‑author of "Mathematical Modeling: Teaching Design and Cases".

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.