Java Reference
InDepth Information
The algorithm to build the decision tree is slightly more complex (see
Algorithm 4.2). This task is called training. Each node is associated with a
set of items (which is a subset of the training set), and a label, which is either
the split feature (intermediate node) or the category name (leaf node). The
tree is built incrementally starting from the root node.
The algorithm is presented in its recursive form, which is in our view
more understandable.
Input: a training set of items T, a set of features F
Output: (the root node of) a decision tree
1 if all items in T belong to the same category return a leaf node labelled
according to such a category
2 if F is empty
(a) if T is empty return a leaf node labelled “unknown”
(b) else return a leaf node labelled “uncertain”
3 select the “best” split feature s â F
4 create a node n labelled s.name
5 for each possible value vi of the feature s
(a) let ni be the result of a recursive execution of this algorithm where the
first input is: Ti # {item â T  item.s ## vi} and the second input is: F 0 {s}
(b) set ni as child node of n and label the connecting arc vi and
6n is the resulting root node of the decision tree
Algorithm 4.2
Build the decision tree
Step 3 of Algorithm 4.2 leaves a detail unspecified. A criterion to identify
which is the best split feature has to be defined. In the literature (see for
instance Cherkassky and Mulier (1998)) several methods to select a split
feature have been proposed. We select the simplest, which is based on infor
mation theory. We split on the feature that contains most information.
According to information theory it is possible to assign to a sequence of
symbols an information content, which is measured in bits.
Given an alphabet
A
containing
N
symbols, the information content of a
sequence of symbols taken from that alphabet can be defined as follows:
N
symbols
I
=
f
i
log
2
(
f
i
)
1
i
=
Equation 4.1
Information content
where
f
i
is the frequency of
i
th symbol. In our case the alphabet is the set of
categories. The sequence of symbols is generated taking each item of a set of
items and looking at its category. Therefore
f
i
is the frequency of category
i
;
it is calculated on set
T
as the number of items in
T
belonging to category
i
divided by the number of items in
T
. In symbols:

{
item
â
T

category
(
item
)
=
category
i
}

f
i
=

T

Equation 4.2
Symbol frequency