Databases Reference
In-Depth Information
followed by “age
45?,” and so on). In replication , duplicate subtrees exist within the
tree. These situations can impede the accuracy and comprehensibility of a decision tree.
The use of multivariate splits (splits based on a combination of attributes) can prevent
these problems. Another approach is to use a different form of knowledge representa-
tion, such as rules, instead of decision trees. This is described in Section 8.4.2, which
shows how a rule-based classifier can be constructed by extracting IF-THEN rules from
a decision tree.
<
8.2.4 Scalability and Decision Tree Induction
“What if D, the disk-resident training set of class-labeled tuples, does not fit in memory? In
other words, how scalable is decision tree induction?” The efficiency of existing decision
tree algorithms, such as ID3, C4.5, and CART, has been well established for relatively
small data sets. Efficiency becomes an issue of concern when these algorithms are applied
to the mining of very large real-world databases. The pioneering decision tree algorithms
that we have discussed so far have the restriction that the training tuples should reside
in memory .
In data mining applications, very large training sets of millions of tuples are com-
mon. Most often, the training data will not fit in memory! Therefore, decision tree
construction becomes inefficient due to swapping of the training tuples in and out
of main and cache memories. More scalable approaches, capable of handling train-
ing data that are too large to fit in memory, are required. Earlier strategies to “save
space” included discretizing continuous-valued attributes and sampling data at each
node. These techniques, however, still assume that the training set can fit in memory.
Several scalable decision tree induction methods have been introduced in recent stud-
ies. RainForest, for example, adapts to the amount of main memory available and applies
to any decision tree induction algorithm. The method maintains an AVC-set (where
“AVC” stands for “ Attribute-Value , Classlabel ”) for each attribute, at each tree node,
describing the training tuples at the node. The AVC-set of an attribute A at node N
gives the class label counts for each value of A for the tuples at N . Figure 8.8 shows AVC-
sets for the tuple data of Table 8.1. The set of all AVC-sets at a node N is the AVC-group
of N . The size of an AVC-set for attribute A at node N depends only on the number of
distinct values of A and the number of classes in the set of tuples at N . Typically, this size
should fit in memory, even for real-world data. RainForest also has techniques, how-
ever, for handling the case where the AVC-group does not fit in memory. Therefore, the
method has high scalability for decision tree induction in very large data sets.
BOAT (Bootstrapped Optimistic Algorithm for Tree construction) is a decision tree
algorithm that takes a completely different approach to scalability—it is not based on
the use of any special data structures. Instead, it uses a statistical technique known as
“bootstrapping” (Section 8.5.4) to create several smaller samples (or subsets) of the
given training data, each of which fits in memory. Each subset is used to construct a
tree, resulting in several trees. The trees are examined and used to construct a new tree,
T 0 , that turns out to be “very close” to the tree that would have been generated if all the
original training data had fit in memory.
 
Search WWH ::




Custom Search