Classifier - Software Development Case Studies in Java

Java Reference

In-Depth Information

The algorithm to build the decision tree is slightly more complex (see

Algorithm 4.2). This task is called training. Each node is associated with a

set of items (which is a subset of the training set), and a label, which is either

the split feature (intermediate node) or the category name (leaf node). The

tree is built incrementally starting from the root node.

The algorithm is presented in its recursive form, which is in our view

more understandable.

Input: a training set of items T, a set of features F

Output: (the root node of) a decision tree

1 if all items in T belong to the same category return a leaf node labelled

according to such a category

2 if F is empty

(a) if T is empty return a leaf node labelled “unknown”

(b) else return a leaf node labelled “uncertain”

3 select the “best” split feature s â F

4 create a node n labelled s.name

5 for each possible value vi of the feature s

(a) let ni be the result of a recursive execution of this algorithm where the

first input is: Ti # {item â T | item.s ## vi} and the second input is: F 0 {s}

(b) set ni as child node of n and label the connecting arc vi and

6n is the resulting root node of the decision tree

Algorithm 4.2 Build the decision tree

Step 3 of Algorithm 4.2 leaves a detail unspecified. A criterion to identify

which is the best split feature has to be defined. In the literature (see for

instance Cherkassky and Mulier (1998)) several methods to select a split

feature have been proposed. We select the simplest, which is based on infor-

mation theory. We split on the feature that contains most information.

According to information theory it is possible to assign to a sequence of

symbols an information content, which is measured in bits.

Given an alphabet A containing N symbols, the information content of a

sequence of symbols taken from that alphabet can be defined as follows:

N

symbols

I

=

f i

log 2 (

f i )

1

i

=

Equation 4.1 Information content

where f i is the frequency of i th symbol. In our case the alphabet is the set of

categories. The sequence of symbols is generated taking each item of a set of

items and looking at its category. Therefore f i is the frequency of category i ;

it is calculated on set T as the number of items in T belonging to category i

divided by the number of items in T . In symbols:

|

{ item

â T |

category ( item )

=

category i }

|

f i =

| T |

Equation 4.2 Symbol frequency

Software Development Case Studies in Java

Search WWH ::

Custom Search

Home