Classification: Basic Concepts - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

Incremental versions of decision tree induction have also been proposed. When

given new training data, these restructure the decision tree acquired from learning on

previous training data, rather than relearning a new tree from scratch.

Differences in decision tree algorithms include how the attributes are selected in

creating the tree (Section 8.2.2) and the mechanisms used for pruning (Section 8.2.3).

The basic algorithm described earlier requires one pass over the training tuples in D for

each level of the tree. This can lead to long training times and lack of available memory

when dealing with large databases. Improvements regarding the scalability of decision

tree induction are discussed in Section 8.2.4. Section 8.2.5 presents a visual interactive

approach to decision tree construction. A discussion of strategies for extracting rules

from decision trees is given in Section 8.4.2 regarding rule-based classification.

8.2.2 Attribute Selection Measures

An attribute selection measure is a heuristic for selecting the splitting criterion that

“best” separates a given data partition, D , of class-labeled training tuples into individual

classes. If we were to split D into smaller partitions according to the outcomes of the

splitting criterion, ideally each partition would be pure (i.e., all the tuples that fall into a

given partition would belong to the same class). Conceptually, the “best” splitting crite-

rion is the one that most closely results in such a scenario. Attribute selection measures

are also known as splitting rules because they determine how the tuples at a given node

are to be split.

The attribute selection measure provides a ranking for each attribute describing the

given training tuples. The attribute having the best score for the measure 4 is chosen as

the splitting attribute for the given tuples. If the splitting attribute is continuous-valued

or if we are restricted to binary trees, then, respectively, either a split point or a splitting

subset must also be determined as part of the splitting criterion. The tree node created

for partition D is labeled with the splitting criterion, branches are grown for each out-

come of the criterion, and the tuples are partitioned accordingly. This section describes

three popular attribute selection measures— information gain, gain ratio , and Gini index .

The notation used herein is as follows. Let D , the data partition, be a training set of

class-labeled tuples. Suppose the class label attribute has m distinct values defining m

distinct classes, C i (for i D 1,

, m ). Let C i , D be the set of tuples of class C i in D . Letj D j

and j C i , D j denote the number of tuples in D and C i , D , respectively.

:::

Information Gain

ID3 uses information gain as its attribute selection measure. This measure is based on

pioneering work by Claude Shannon on information theory, which studied the value or

“information content” of messages. Let node N represent or hold the tuples of partition

D . The attribute with the highest information gain is chosen as the splitting attribute for

node N . This attribute minimizes the information needed to classify the tuples in the

4 Depending on the measure, either the highest or lowest score is chosen as the best (i.e., some measures

strive to maximize while others strive to minimize).

Search WWH ::

Custom Search

Home