Classification: Basic Concepts - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

from large databases are discussed in Section 8.2.4. Section 8.2.5 presents a visual mining

approach to decision tree induction.

8.2.1 Decision Tree Induction

During the late 1970s and early 1980s, J. Ross Quinlan, a researcher in machine learning,

developed a decision tree algorithm known as ID3 (Iterative Dichotomiser). This work

expanded on earlier work on concept learning systems , described by E. B. Hunt, J. Marin,

and P. T. Stone. Quinlan later presented C4.5 (a successor of ID3), which became a

benchmark to which newer supervised learning algorithms are often compared. In 1984,

a group of statisticians (L. Breiman, J. Friedman, R. Olshen, and C. Stone) published

the topic Classification and Regression Trees ( CART ), which described the generation of

binary decision trees. ID3 and CART were invented independently of one another at

around the same time, yet follow a similar approach for learning decision trees from

training tuples. These two cornerstone algorithms spawned a flurry of work on decision

tree induction.

ID3, C4.5, and CART adopt a greedy (i.e., nonbacktracking) approach in which deci-

sion trees are constructed in a top-down recursive divide-and-conquer manner. Most

algorithms for decision tree induction also follow a top-down approach, which starts

with a training set of tuples and their associated class labels. The training set is recur-

sively partitioned into smaller subsets as the tree is being built. A basic decision tree

algorithm is summarized in Figure 8.3. At first glance, the algorithm may appear long,

but fear not! It is quite straightforward. The strategy is as follows.

The algorithm is called with three parameters: D , attribute list , and Attribute

selection method . We refer to D as a data partition. Initially, it is the complete set

of training tuples and their associated class labels. The parameter attribute list is a

list of attributes describing the tuples. Attribute selection method specifies a heuris-

tic procedure for selecting the attribute that “best” discriminates the given tuples

according to class. This procedure employs an attribute selection measure such as

information gain or the Gini index. Whether the tree is strictly binary is generally

driven by the attribute selection measure. Some attribute selection measures, such as

the Gini index, enforce the resulting tree to be binary. Others, like information gain,

do not, therein allowing multiway splits (i.e., two or more branches to be grown from

a node).

The tree starts as a single node, N , representing the training tuples in D (step 1). 3

3 The partition of class-labeled training tuples at node N is the set of tuples that follow a path from

the root of the tree to node N when being processed by the tree. This set is sometimes referred to in

the literature as the family of tuples at node N . We have referred to this set as the “tuples represented

at node N ,” “the tuples that reach node N ,” or simply “the tuples at node N .” Rather than storing the

actual tuples at a node, most implementations store pointers to these tuples.

Search WWH ::

Custom Search

Home