Advanced Analytical Theory and Methods: Classification - Data Science and Big Data Analytics

Database Reference

In-Depth Information

The first step in constructing a decision tree is to choose the most informative

attribute. A common way to identify the most informative attribute is to use

entropy-based methods, which are used by decision tree learning algorithms such

as ID3 (or Iterative Dichotomiser 3) [7] and C4.5 [8]. The entropy methods select

the most informative attribute based on two basic measures:

• Entropy, which measures the impurity of an attribute

• Information gain, which measures the purity of an attribute

Given a class

and its label

, let

be the probability of .

the entropy

of , is defined as shown in Equation 7.1 .

7.1

Equation 7.1 shows that entropy becomes 0 when all is 0 or 1. For a binary

classification ( true or false ), is zero if the probability of each label

is either zero or one. On the other hand, achieves the maximum entropy when

all the class labels are equally probable. For a binary classification, if the

probability of all class labels is 50/50. The maximum entropy increases as the

number of possible outcomes increases.

As an example of a binary random variable, consider tossing a coin with known,

not necessarily fair, probabilities of coming up heads or tails. The corresponding

entropy graph is shown in Figure 7.5 . Let represent heads and

represent tails. The entropy of the unknown result of the next toss is maximized

when the coin is fair. That is, when heads and tails have equal probability

, entropy .

On the other hand, if the coin is not fair, the probabilities of heads and tails would

not be equal and there would be less uncertainty. As an extreme case, when the

probability of tossing a head is equal to 0 or 1, the entropy is minimized to 0.

Therefore, the entropy for a completely pure variable is 0 and is 1 for a set with

equal occurrences for both the classes (head and tail, or yes and no).

Search WWH ::

Custom Search

Home