# Entropy: how Decision Trees make decisions

**Translation of the article prepared in advance of the start of the course Machine Learning.**You are a Data Science Specialist who is currently following a learning path. And you have come a long way since you wrote your first line of code in Python or R. You know Scikit-Learn like the back of your hand. Now you are more sitting on Kaggle than on Facebook. You are not new to creating stunning random forests and other ensemble models of decision trees that do an excellent job. Nevertheless, you know that you will not achieve anything if you do not develop comprehensively. You want to dig deeper and understand the intricacies and concepts underlying the popular machine learning models. Well, me too.

Today I will talk about the concept of entropy – one of the most important topics in statistics, and later we will talk about the concept of Information Gain (information gain) and find out why these fundamental concepts form the basis of how decision trees are built from the data obtained.

Good. Now let’s transgress.

What is entropy? In simple terms, entropy is nothing but a measure of disorder. (It can also be considered a measure of purity, and soon you will see why. But I like the mess more because it sounds cooler.)

The mathematical formula of entropy is as follows:

*Entropy. Sometimes recorded as H.*

Here p_{i} Is the frequency probability of the element / class i of our data. For simplicity, suppose we have only two classes: positive and negative. Then i will take the value of either “+” or “-“. If we had a total of 100 points in our dataset, 30 of which would belong to the positive class and 70 to the negative, then p_{+} would be 3/10, and p_{–} will be 7/10. Everything is simple here.

If I calculate the class entropy from this example, here is what I get using the formula above:

Entropy will be approximately 0.88. This value is considered quite high, that is, we have a high level of entropy or disorder (that is, a low value of purity). Entropy is measured in the range from 0 to 1. Depending on the number of classes in your dataset, the value of entropy may turn out to be more than 1, but it will mean the same as the level of disorder is extremely high. For simplicity of explanation, the entropy in our article will be in the range from 0 to 1.

Take a look at the chart below.

On the X axis, the number of points from the positive class in each circle is reflected, and on the Y axis, the corresponding entropies. You can immediately notice the inverted U-shape of the graph. Entropy will be the smallest at extrema when there are no positive elements in the circle in principle, or when there are only positive elements in them. That is, when the elements are identical in a circle, the disorder will be 0. The entropy will be highest in the middle of the graph, where positive and negative elements will be evenly distributed inside the circle. Here the greatest entropy or disorder will be achieved, since there will be no predominant elements.

Is there any reason that the entropy is measured using the base 2 logarithm, or why is the entropy measured between 0 and 1, and not in a different range? No, there is no reason. This is just a metric. It’s not so important to understand why this is happening. It is important to know how what we got above is calculated and how it works. Entropy is a measure of confusion or uncertainty, and the goal of machine learning models and Data Science specialists in general is to reduce this uncertainty.

Now we know how mess is measured. Next, we need a value to measure the reduction of this disorder in the additional information (attributes / independent variables) of the target variable / class. This is where the Information Gain or Information Gain comes into play. From the point of view of mathematics, it can be written as follows:

We simply subtract the entropy Y from X, from the entropy Y, in order to calculate the decrease in uncertainty about Y, provided that there is information about X about Y. The stronger the uncertainty decreases, the more information can be obtained from Y about X.

Let’s look at a simple example of a contingency table to get closer to the question of how decision trees use entropy and information gain in order to decide on what basis to break nodes into data in the learning process.

**Example: Conjugation Table**

Here our target variable will be *Liability*, which can take only two values: *“Normal”* and *“High”*. We also have only one sign, which is called Credit Rating, it distributes the values into three categories: *“Excellent”*, *“Good”* and *“Poor”*. A total of 14 observations were made. 7 of them belong to the class *Normal liability*, and 7 more to the class *High liability*. This is a division in itself.

If we look at the total amount of values in the first row, we will see that we have 4 observations with a value *Excellent* according to *Credit rating*. Moreover, I can even say that my target variable is broken down by *“Excellent” Credit Rating*. Among the observations with the value *“Excellent”* according to *Credit rating*there are 3 that belong to the class *Normal liability* and 1, which refers to *High liability*. Similarly, I can calculate similar results for other values. *Credit rating* from the contingency table.

For example, I use the contingency table above to independently calculate the entropy of our target variable, and then calculate its entropy, taking into account the additional information of the attribute *Credit rating*. So I can calculate how much additional information will give me *Credit rating* for target variable *Liability*.

So let’s get started.

The entropy of our target variable is 1, which means maximum disorder due to the uniform distribution of elements between *“Normal”* and *“High”*. The next step is to calculate the entropy of the target variable. *Liability* subject to additional information from *Credit rating*. For this, we calculate the entropy *Liability* for each value *Credit rating* and add them using the weighted average of the observation ratios for each value. Why we use the weighted average will become clearer when we talk about decision trees.

We have obtained the entropy of our target variable with the attribute *Credit rating*. Now we can calculate the information gain *Liability* from *Credit rating*to understand how informative this feature is.

Knowledge *Credit rating* helped us reduce the uncertainty of our target variable *Liability*. Isn’t that a good sign that should work? Give us information about the target variable? Well, for this very reason, decision trees use entropy and informational gain. They determine by what criterion to break the nodes into branches, in order to approach the target variable with each subsequent partition, and also to understand when the tree construction needs to be completed! (in addition to hyperparameters such as maximum depth, of course). Let’s see how this all works in the following example using decision trees.

**Example: Decision Tree**

Let’s look at an example of building a decision tree to predict whether a person’s credit will be written off or not. The population will be 30 copies. 16 will belong to class *“Write-off”*and the other 14 to *“Non-write-off”*. We will have two signs, namely *“Balance”*, which can take two values: “< 50K” или “>50K ”, and *“Residence”*which takes three values:* “OWN”*, *“RENT”* or *“OTHER”*. I will demonstrate how the decision tree algorithm will decide which attribute to break first and which attribute will be more informative, that is, it best eliminates the uncertainty of the target variable using the concept of entropy and information gain.

**Sign 1: Balance**

Here the circles belong to the class *“Write-off”*and stars to class *“Non-write-off”*. Partitioning a Parent Root by Attribute *Balance* will give us 2 heir nodes. There will be 13 observations in the left node, where 12/13 (probability 0.92) of observations from the class *“Write-off”*, and only 1/13 (probability 0.08) of observations from the class *“Non-write-off”*. In the right node there will be 17 out of 30 observations, where 13/17 (probability 0.76) of observations from the class *“Write-off”* and 4/17 (probability 0.24) of observations from the class *“Non-write-off”*.

Let’s calculate the entropy of the root and see how much the tree can reduce uncertainty by partitioning by attribute *Balance*.

Partition by feature *Balance* will give an informational gain of 0.37. Let’s count the same for the tag *Residence* and compare the results.

**Sign 2: Residence**

Tree splitting based on *Residence* will give you 3 heir nodes. The left heir will receive 8 observations, where 7/8 (probability 0.88) of observations from the class *“Write-off”* and only 1/8 (probability 0.12) of observations from the class *“Non-write-off”*. The average successor node will receive 10 observations, where 4/10 (probability 0.4) of observations from the class *“Write-off”* and 6/10 (probability 0.6) of observations from the class *“Non-write-off”*. The right heir will receive 12 observations, where 5/12 (probability 0.42) of observations from the class *“Write-off”* and 7/12 (probability 0.58) observations from the class *“Non-write-off”*. We already know the entropy of the parent node, so we simply calculate the entropy after the partition to understand the information gain from the trait *Residence*.

Sign information gain *Balance* almost 3 times more than from *Residence*! If you look at the graphs again, you will see that the partition by *Balance* will give cleaner heir nodes than by *Residence*. However the leftmost knot in *Residence* also quite clean, but this is where the weighted average comes into play. Despite the fact that the node is clean, it has the least number of observations, and its result is lost in the general recalculation and calculation of the total entropy from *Residence*. This is important because we are looking for the general informational content of the attribute and do not want the final result to be distorted by the rare value of the attribute.

A sign in itself *Balance* gives more information about the target variable than *Residence*. Thus, the entropy of our target variable is reduced. The decision tree algorithm uses this result to make the first partition by attribute *Balance*to later decide on what basis to break the following nodes. In the real world, when there are more than two features, the first breakdown occurs according to the most informative feature, and then, with each subsequent breakup, the information gain will be recounted for each additional feature, since it will not be the same as the information gain from each feature individually. Entropy and informational gain should be calculated after one or several partitions have occurred, which will affect the final result. The decision tree will repeat this process as it grows in depth, until it either reaches a certain depth or some kind of splitting leads to a higher informational gain over a certain threshold, which can also be specified as a hyperparameter!

That’s all! Now you know what entropy, information gain and how they are calculated. Now you understand how the decision tree, by itself or as part of an ensemble, makes decisions about the best order of partitioning by attributes and decides when to stop when learning the available data. Well, if you have to explain to someone how decision trees work, I hope you will adequately cope with this task.

I hope you have learned something useful for yourself from this article. If I missed something or expressed myself inaccurately, write to me about it. I’ll be very grateful to you! Thank.