Java Reference
InDepth Information
Node n
#
new
Node(splitFeature);
// 3) for each possible value vi of s
FeatureType splitType
#
(FeatureType)features.get(splitFeature);
Map partitions
#
performSplit(items,splitFeature,
splitType.allowedValues());
for
(Iterator iter
#
partitions.keySet().iterator();
iter.hasNext();) {
String value
#
(String) iter.next();
// a) be ni the result of a recursive execution of this
// algorithm where:
// the fist input is: Ti
#
{ item in T

item.s
##
vi }
// the second input is: A  { s }
Map partition
#
(Map)partitions.get(value);
Map remaining
#
new
HashMap(features);
remaining.remove(splitFeature);
Node child
#
buildDecisionTree(partition,remaining);
// b) set ni as child node of n and label the
// connecting arc vi
n.addChild(value,child);
}
// 4) n is the resulting root node of the decision tree
return
n;
}
In step 0 to determine whether all the items in a set belong to the same
category we check if the information content is null (see Equation 4.3).
In step 3, the loop operates on each of the sets of items resulting from the
split. It is more efficient to generate all these sets in a single step instead of
generating them separately. To keep the link between each value of the split
feature and the relative splitset a
Map
is used. A map stores associations
between a key (the possible value) and a value (the corresponding set of
items), making it very easy to obtain the set of items given the value.
To keep the method
buildDecisionTree
as close as possible to the original
algorithm (Algorithm 4.2), several details are missing:
how to determine the category of a set of items,
■
how to find the “best” split feature,
■
how to split a set of items,
■
how to calculate information content (in terms of number of bits) of a set
of items,
■
how to calculate the information gain deriving from a split.
■
The
findCategory()
method determines which is the category of a set of
items. This is not as easy as it seems, since the set can be empty or it can
contain items from several categories. If the set is empty then the category
is unknown, otherwise the frequency of the categories must be computed.