Database Reference
In-Depth Information
going respectively to trees ID3(A-{D}, P, ),
ID3(A-{D}, P, ), . . . ID3(A-{D}, P,
)
C4.5
The C4.5 algorithm [8] introduces a number of improvements over the original
ID3 algorithm. The C4.5 algorithm can handle missing data. If the training records
contain unknown attribute values, the C4.5 evaluates the gain for an attribute by
considering only the records where the attribute is defined.
Both categorical and continuous attributes are supported by C4.5. Values of a
continuous variable are sorted and partitioned. For the corresponding records of
each partition, the gain is calculated, and the partition that maximizes the gain is
chosen for the next split.
The ID3 algorithm may construct a deep and complex tree, which would cause
overfitting (Section 7.1.4). The C4.5 algorithm addresses the overfitting problem
in ID3 by using a bottom-up technique called pruning to simplify the tree by
removing the least visited nodes and branches.
CART
CART (or Classification And Regression Trees) [9] is often used as a generic
acronym for the decision tree, although it is a specific implementation.
Similar to C4.5, CART can handle continuous attributes. Whereas C4.5 uses
entropy-based criteria to rank tests, CART uses the Gini diversity index defined in
Equation 7.5 .
7.5
Whereas C4.5 employs stopping rules, CART constructs a sequence of subtrees,
uses cross-validation to estimate the misclassification cost of each subtree, and
chooses the one with the lowest cost.
7.1.4 Evaluating a Decision Tree
Decision trees use greedy algorithms , in that they always choose the option
that seems the best available at that moment. At each step, the algorithm selects
which attribute to use for splitting the remaining records. This selection may not be
the best overall, but it is guaranteed to be the best at that step. This characteristic
Search WWH ::




Custom Search