Advanced Analytical Theory and Methods: Classification - Data Science and Big Data Analytics

Database Reference

In-Depth Information

going respectively to trees ID3(A-{D}, P, ),

ID3(A-{D}, P, ), . . . ID3(A-{D}, P,

)

C4.5

The C4.5 algorithm [8] introduces a number of improvements over the original

ID3 algorithm. The C4.5 algorithm can handle missing data. If the training records

contain unknown attribute values, the C4.5 evaluates the gain for an attribute by

considering only the records where the attribute is defined.

Both categorical and continuous attributes are supported by C4.5. Values of a

continuous variable are sorted and partitioned. For the corresponding records of

each partition, the gain is calculated, and the partition that maximizes the gain is

chosen for the next split.

The ID3 algorithm may construct a deep and complex tree, which would cause

overfitting (Section 7.1.4). The C4.5 algorithm addresses the overfitting problem

in ID3 by using a bottom-up technique called pruning to simplify the tree by

removing the least visited nodes and branches.

CART

CART (or Classification And Regression Trees) [9] is often used as a generic

acronym for the decision tree, although it is a specific implementation.

Similar to C4.5, CART can handle continuous attributes. Whereas C4.5 uses

entropy-based criteria to rank tests, CART uses the Gini diversity index defined in

Equation 7.5 .

7.5

Whereas C4.5 employs stopping rules, CART constructs a sequence of subtrees,

uses cross-validation to estimate the misclassification cost of each subtree, and

chooses the one with the lowest cost.

7.1.4 Evaluating a Decision Tree

Decision trees use greedy algorithms , in that they always choose the option

that seems the best available at that moment. At each step, the algorithm selects

which attribute to use for splitting the remaining records. This selection may not be

the best overall, but it is guaranteed to be the best at that step. This characteristic

Search WWH ::

Custom Search

Home