Biology Reference
In-Depth Information
electrical engineer who worked on communications theory. Amongst many other
achievements, he produced a set of metrics that are widely used in many different
fields. One of these metrics is entropy, which is essentially a measure of the random-
ness of a system (
Figure 2.9
). Entropy is the expected number of bits required to
encode the classes, C
1
or C
2
, of a randomly drawn member of a signal,
S
, under
the optimal, shortest-length code:
p
C1
log
2
p
C2
;
where
p
C1
is the proportion of
S
having type C
1
, and
p
C2
is the proportion of type C2
(
Figure 2.9
). The information value of a variable,
A
, is calculated based upon its
information gain
: the expected reduction in entropy due to sorting on
A
.
Entropy
S
ðÞ
p
C1
log
2
p
C2
X
S
jj
S
Gain
S
ð
A
Þ
Entropy
S
ðÞ
Entropy
Sð
:
;
jj
v2
Values
ðÞ
At each iteration of the algorithm, the information gain is calculated for each variable
A
in turn, and the variable which provides maximum information gain is selected as
the best decision attribute for that node. For each value of
A
, a new descendant node
is created, and the training examples are sorted to the nodes. If the training examples
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Proportion of positives
FIGURE 2.9
The entropy of a population increases with the proportion of positives, up to 50%, and then
decreases smoothly.