Information Technology Reference
In-Depth Information
5.3.2 Theory
Decision trees are nonparametric statistical techniques used for
classifi cation problems (Breiman et al., 1984). They are generated by
recursively partitioning the data set, using a splitting attribute until all
the records in the partition belong to the same class (Chandra and
Varghesse, 2009). Recursive partitioning is a method for determination
of statistically meaningful rules that classify objects into similar categories
(Rusinko et al., 1999).
Being analogous to naturally occurring trees, decision trees consist of
branches and nodes. Nodes are either decision-making or terminating.
Terminating nodes, or tree leaves, represent specifi c classes of the data.
Decision nodes defi ne a test to be conducted on one of the data attributes,
in order to split the data into subsets. The tree starts from the root and
the branches lead to the subsequent nodes. The decision tree can be
interpreted as a rule induction technique, if different scenarios are
forecast from the tree structure. As a rule, the induction technique is
similar to previously described fuzzy logic.
Different algorithms are used for decision tree construction: ID3, C4.5,
C5.0, Classifi cation and Regression Trees (CART), Chi-squared
Automatic Interaction Detector (CHAID), etc. The most popular
algorithms, ID3, C4.5, and C5.0, were developed by Quinlan (1986,
1993). The ID3 algorithm was the fi rst and was based on the Concept
Learning System algorithm. The ID3 algorithm searches through the
attributes of the data set and extracts one attribute that best separates
given samples. Then it searches through the subsets created, to fi nd their
best attribute for further partitioning of the data set. There is no possibility
to go back once the subsets are created; therefore the same attribute
cannot be selected twice during tree construction. When an attribute is
found to perfectly classify samples, the ID3 algorithm stops. One of the
shortcomings of the ID3 algorithm is that it only deals with discrete
values of data samples. Therefore, discretization of continuous values is
necessary to perform successful classifi cation and different discretization
techniques are available.
The C4.5 algorithm is an improved version of the ID3 algorithm in
terms of possibility to also analyze continuous attributes. The algorithm
itself splits continuous samples into 2 sets, smaller and greater than a
threshold value (Quinlan, 1996). Also, it allows handling of data with
missing attribute values and prunes trees after creation. In order to limit
excessive partitioning of the data, tree depth (i.e. branching) can be
controlled. This is done by so-called tree pruning, which is the process of
￿
￿
￿
 
Search WWH ::




Custom Search