Databases Reference
In-Depth Information
Incremental versions of decision tree induction have also been proposed. When
given new training data, these restructure the decision tree acquired from learning on
previous training data, rather than relearning a new tree from scratch.
Differences in decision tree algorithms include how the attributes are selected in
creating the tree (Section 8.2.2) and the mechanisms used for pruning (Section 8.2.3).
The basic algorithm described earlier requires one pass over the training tuples in D for
each level of the tree. This can lead to long training times and lack of available memory
when dealing with large databases. Improvements regarding the scalability of decision
tree induction are discussed in Section 8.2.4. Section 8.2.5 presents a visual interactive
approach to decision tree construction. A discussion of strategies for extracting rules
from decision trees is given in Section 8.4.2 regarding rule-based classification.
8.2.2 Attribute Selection Measures
An attribute selection measure is a heuristic for selecting the splitting criterion that
“best” separates a given data partition, D , of class-labeled training tuples into individual
classes. If we were to split D into smaller partitions according to the outcomes of the
splitting criterion, ideally each partition would be pure (i.e., all the tuples that fall into a
given partition would belong to the same class). Conceptually, the “best” splitting crite-
rion is the one that most closely results in such a scenario. Attribute selection measures
are also known as splitting rules because they determine how the tuples at a given node
are to be split.
The attribute selection measure provides a ranking for each attribute describing the
given training tuples. The attribute having the best score for the measure 4 is chosen as
the splitting attribute for the given tuples. If the splitting attribute is continuous-valued
or if we are restricted to binary trees, then, respectively, either a split point or a splitting
subset must also be determined as part of the splitting criterion. The tree node created
for partition D is labeled with the splitting criterion, branches are grown for each out-
come of the criterion, and the tuples are partitioned accordingly. This section describes
three popular attribute selection measures— information gain, gain ratio , and Gini index .
The notation used herein is as follows. Let D , the data partition, be a training set of
class-labeled tuples. Suppose the class label attribute has m distinct values defining m
distinct classes, C i (for i D 1,
, m ). Let C i , D be the set of tuples of class C i in D . Letj D j
and j C i , D j denote the number of tuples in D and C i , D , respectively.
:::
Information Gain
ID3 uses information gain as its attribute selection measure. This measure is based on
pioneering work by Claude Shannon on information theory, which studied the value or
“information content” of messages. Let node N represent or hold the tuples of partition
D . The attribute with the highest information gain is chosen as the splitting attribute for
node N . This attribute minimizes the information needed to classify the tuples in the
4 Depending on the measure, either the highest or lowest score is chosen as the best (i.e., some measures
strive to maximize while others strive to minimize).
 
Search WWH ::




Custom Search