Data mining for microbiologists - Methods in Microbiology - page 41

Biology Reference

In-Depth Information

electrical engineer who worked on communications theory. Amongst many other

achievements, he produced a set of metrics that are widely used in many different

fields. One of these metrics is entropy, which is essentially a measure of the random-

ness of a system ( Figure 2.9 ). Entropy is the expected number of bits required to

encode the classes, C 1 or C 2 , of a randomly drawn member of a signal, S , under

the optimal, shortest-length code:

p C1 log 2 p C2 ;

where p C1 is the proportion of S having type C 1 , and p C2 is the proportion of type C2

( Figure 2.9 ). The information value of a variable, A , is calculated based upon its

information gain : the expected reduction in entropy due to sorting on A .

Entropy S

ðÞ

p C1 log 2 p C2

X

S jj

S

Gain S

ð

A

Þ

Entropy S

ðÞ

Entropy Sð :

;

jj

v2 Values ðÞ

At each iteration of the algorithm, the information gain is calculated for each variable

A in turn, and the variable which provides maximum information gain is selected as

the best decision attribute for that node. For each value of A , a new descendant node

is created, and the training examples are sorted to the nodes. If the training examples

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Proportion of positives

FIGURE 2.9

The entropy of a population increases with the proportion of positives, up to 50%, and then

decreases smoothly.

Next Page

Methods in Microbiology

Search WWH ::

Custom Search

Home