Classification: Basic Concepts - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

followed by “age

45?,” and so on). In replication , duplicate subtrees exist within the

tree. These situations can impede the accuracy and comprehensibility of a decision tree.

The use of multivariate splits (splits based on a combination of attributes) can prevent

these problems. Another approach is to use a different form of knowledge representa-

tion, such as rules, instead of decision trees. This is described in Section 8.4.2, which

shows how a rule-based classifier can be constructed by extracting IF-THEN rules from

a decision tree.

<

8.2.4 Scalability and Decision Tree Induction

“What if D, the disk-resident training set of class-labeled tuples, does not fit in memory? In

other words, how scalable is decision tree induction?” The efficiency of existing decision

tree algorithms, such as ID3, C4.5, and CART, has been well established for relatively

small data sets. Efficiency becomes an issue of concern when these algorithms are applied

to the mining of very large real-world databases. The pioneering decision tree algorithms

that we have discussed so far have the restriction that the training tuples should reside

in memory .

In data mining applications, very large training sets of millions of tuples are com-

mon. Most often, the training data will not fit in memory! Therefore, decision tree

construction becomes inefficient due to swapping of the training tuples in and out

of main and cache memories. More scalable approaches, capable of handling train-

ing data that are too large to fit in memory, are required. Earlier strategies to “save

space” included discretizing continuous-valued attributes and sampling data at each

node. These techniques, however, still assume that the training set can fit in memory.

Several scalable decision tree induction methods have been introduced in recent stud-

ies. RainForest, for example, adapts to the amount of main memory available and applies

to any decision tree induction algorithm. The method maintains an AVC-set (where

“AVC” stands for “ Attribute-Value , Classlabel ”) for each attribute, at each tree node,

describing the training tuples at the node. The AVC-set of an attribute A at node N

gives the class label counts for each value of A for the tuples at N . Figure 8.8 shows AVC-

sets for the tuple data of Table 8.1. The set of all AVC-sets at a node N is the AVC-group

of N . The size of an AVC-set for attribute A at node N depends only on the number of

distinct values of A and the number of classes in the set of tuples at N . Typically, this size

should fit in memory, even for real-world data. RainForest also has techniques, how-

ever, for handling the case where the AVC-group does not fit in memory. Therefore, the

method has high scalability for decision tree induction in very large data sets.

BOAT (Bootstrapped Optimistic Algorithm for Tree construction) is a decision tree

algorithm that takes a completely different approach to scalability—it is not based on

the use of any special data structures. Instead, it uses a statistical technique known as

“bootstrapping” (Section 8.5.4) to create several smaller samples (or subsets) of the

given training data, each of which fits in memory. Each subset is used to construct a

tree, resulting in several trees. The trees are examined and used to construct a new tree,

T 0 , that turns out to be “very close” to the tree that would have been generated if all the

original training data had fit in memory.

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home