Advanced Decision Trees - Data Mining with Decision Trees: Theory and Applications

Database Reference

In-Depth Information

11.8.1

Accelerating Tree Induction

Catlett (1991) has examined two methods for eciently growing decision

trees from a large database by reducing the computation complexity

required for induction. However, the Catlett method requires that all

data will be loaded into the main memory before induction. Namely, the

largest dataset that can be induced is bounded by the memory size. Fifield

(1992) suggests parallel implementation of the ID3 Algorithm. However,

like Catlett it assumes that all dataset can fit in the main memory.

Chan and Stolfo (1997) suggest to partition the datasets into several

disjointed datasets, such that each dataset is loaded separately into the

memory and used to induce a decision tree. The decision trees are then

combined to create a single classifier. However, the experimental results

indicate that partition may reduce the classification performance, meaning

that the classification accuracy of the combined decision trees is not as

good as the accuracy of a single decision tree induced from the entire

dataset.

There are several decision tree algorithms which implement disk-based

approaches that remove the requirement of loading the entire dataset to the

main memory. These algorithms optimize the number of reads and writes

to secondary storage during tree construction.

The SLIQ algorithm [ Mehta et al . (1996) ] does not require loading the

entire dataset into the main memory, instead it uses a secondary memory

(disk) namely a certain instance is not necessarily resident in the main

memory all the time. SLIQ creates a single decision tree from the entire

dataset. However, this method also has an upper limit for the largest dataset

that can be processed, because it uses a data structure that scales with

the dataset size and this data structure is required to be resident in main

memory all the time.

The SPRINT algorithm uses a similar approach [ Shafer et al . (1996) ] .

This algorithm induces decision trees relatively quickly and removes all of

the memory restrictions from decision tree induction. SPRINT scales any

impurity based split criteria for large datasets. Gehrke et al . (2000) intro-

duced RainForest; a unifying framework for decision tree classifiers that are

capable to scale any specific algorithms from the literature (including C4.5,

CART and CHAID). In addition to its generality, RainForest improves

SPRINT on a factor of three. In contrast to SPRINT, however, RainForest

requires a certain minimum amount of main memory, proportional to the

set of distinct values in a column of the input relation. However, this

requirement is considered modest and reasonable.

Search WWH ::

Custom Search

Home