Biology Reference
In-Depth Information
electrical engineer who worked on communications theory. Amongst many other
achievements, he produced a set of metrics that are widely used in many different
fields. One of these metrics is entropy, which is essentially a measure of the random-
ness of a system ( Figure 2.9 ). Entropy is the expected number of bits required to
encode the classes, C 1 or C 2 , of a randomly drawn member of a signal, S , under
the optimal, shortest-length code:
p C1 log 2 p C2 ;
where p C1 is the proportion of S having type C 1 , and p C2 is the proportion of type C2
( Figure 2.9 ). The information value of a variable, A , is calculated based upon its
information gain : the expected reduction in entropy due to sorting on A .
Entropy S
ðÞ
p C1 log 2 p C2
X
S jj
S
Gain S
ð
A
Þ
Entropy S
ðÞ
Entropy :
;
jj
v2 Values ðÞ
At each iteration of the algorithm, the information gain is calculated for each variable
A in turn, and the variable which provides maximum information gain is selected as
the best decision attribute for that node. For each value of A , a new descendant node
is created, and the training examples are sorted to the nodes. If the training examples
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Proportion of positives
FIGURE 2.9
The entropy of a population increases with the proportion of positives, up to 50%, and then
decreases smoothly.
Search WWH ::




Custom Search