Information Technology Reference
In-Depth Information
The process of new attribute selection and partitioning is continued for
each node, until all attributes are used or when samples associated with
a specifi c node all have the same value for the target attribute (i.e. their
entropy is 0). ID3 and C4.5 algorithms (and some of their improved
versions) use information gain as a splitting criterion. Gain ratio is a
modifi cation of the information gain that takes into account the intrinsic
information of a split, that is, it takes the number and size of branches
into account when choosing an attribute (Abraham et al., 2009).
The Gini index (Gini coeffi cient, Gini impurity) identifi es a particular
attribute and the cut point for that attribute that minimizes the variance
or diversity in each of the two subsets that emerge from the split (Breiman
et al., 1984):
[5.20]
The Gini index is always between 0 and 1, regardless of the number of
classes, c , of an attribute in the data set. When all samples are in the same
class, the Gini index has a value of 0, whereas it reaches its maximum
value of 1 when the probability that a sample belongs to one of the classes
is the same for each class. The CART algorithm uses the Gini index as the
splitting criterion.
In the case of simultaneous multiple-class classifi cation, random forests
can be developed where the forest consists of trees and each tree is created
from a bootstrapped sample of the data set. GAs have been used to avoid
local optimal decisions and search the decision tree space with little a
priori bias (Papagelis and Kalles, 2001).
The building of a decision tree is a supervised process, which means that
all data attributes and classes are defi ned before tree construction begins.
Also, each sample of the data set must belong to a certain class and there
should be more samples than classes (in order to avoid classes containing
only one sample). It is recommended to have a data set of many samples
for tree construction. Parameters that the user needs to input, in order to
build a decision tree, depend on the software used. Most of the programs
allow automatic selection of predefi ned parameters, but users are
sometimes encouraged to examine usage of different algorithms, splitting
criterion, pruning parameters, minimum samples per class, thresholds, etc.
Also, once the decision tree is built, it is important to evaluate its
predictability. This is usually done by the cross-validation approach.
Accuracy is the overall classifi cation criteria of a prediction model; it
corresponds to the ratio of correctly classifi ed compounds to the total
compounds (Choi et al., 2009). Since most decision trees are used for
￿
￿
￿
 
Search WWH ::




Custom Search