Database Reference
In-Depth Information
11.8.1
Accelerating Tree Induction
Catlett (1991) has examined two methods for eciently growing decision
trees from a large database by reducing the computation complexity
required for induction. However, the Catlett method requires that all
data will be loaded into the main memory before induction. Namely, the
largest dataset that can be induced is bounded by the memory size. Fifield
(1992) suggests parallel implementation of the ID3 Algorithm. However,
like Catlett it assumes that all dataset can fit in the main memory.
Chan and Stolfo (1997) suggest to partition the datasets into several
disjointed datasets, such that each dataset is loaded separately into the
memory and used to induce a decision tree. The decision trees are then
combined to create a single classifier. However, the experimental results
indicate that partition may reduce the classification performance, meaning
that the classification accuracy of the combined decision trees is not as
good as the accuracy of a single decision tree induced from the entire
dataset.
There are several decision tree algorithms which implement disk-based
approaches that remove the requirement of loading the entire dataset to the
main memory. These algorithms optimize the number of reads and writes
to secondary storage during tree construction.
The SLIQ algorithm [ Mehta et al . (1996) ] does not require loading the
entire dataset into the main memory, instead it uses a secondary memory
(disk) namely a certain instance is not necessarily resident in the main
memory all the time. SLIQ creates a single decision tree from the entire
dataset. However, this method also has an upper limit for the largest dataset
that can be processed, because it uses a data structure that scales with
the dataset size and this data structure is required to be resident in main
memory all the time.
The SPRINT algorithm uses a similar approach [ Shafer et al . (1996) ] .
This algorithm induces decision trees relatively quickly and removes all of
the memory restrictions from decision tree induction. SPRINT scales any
impurity based split criteria for large datasets. Gehrke et al . (2000) intro-
duced RainForest; a unifying framework for decision tree classifiers that are
capable to scale any specific algorithms from the literature (including C4.5,
CART and CHAID). In addition to its generality, RainForest improves
SPRINT on a factor of three. In contrast to SPRINT, however, RainForest
requires a certain minimum amount of main memory, proportional to the
set of distinct values in a column of the input relation. However, this
requirement is considered modest and reasonable.
Search WWH ::




Custom Search