Classification: Basic Concepts - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

is a “forest.” The individual decision trees are generated using a random selection of

attributes at each node to determine the split. More formally, each tree depends on the

values of a random vector sampled independently and with the same distribution for

all trees in the forest. During classification, each tree votes and the most popular class is

returned.

Random forests can be built using bagging (Section 8.6.2) in tandem with random

attribute selection. A training set, D , of d tuples is given. The general procedure to gen-

erate k decision trees for the ensemble is as follows. For each iteration, i

, k ),

a training set, D i , of d tuples is sampled with replacement from D . That is, each D i is a

bootstrap sample of D (Section 8.5.4), so that some tuples may occur more than once

in D i , while others may be excluded. Let F be the number of attributes to be used to

determine the split at each node, where F is much smaller than the number of avail-

able attributes. To construct a decision tree classifier, M i , randomly select, at each node,

F attributes as candidates for the split at the node. The CART methodology is used to

grow the trees. The trees are grown to maximum size and are not pruned. Random

forests formed this way, with random input selection , are called Forest-RI.

Another form of random forest, called Forest-RC, uses random linear combinations

of the input attributes. Instead of randomly selecting a subset of the attributes, it cre-

ates new attributes (or features) that are a linear combination of the existing attributes.

That is, an attribute is generated by specifying L , the number of original attributes to be

combined. At a given node, L attributes are randomly selected and added together with

coefficients that are uniform random numbers on [1, 1]. F linear combinations are

generated, and a search is made over these for the best split. This form of random forest

is useful when there are only a few attributes available, so as to reduce the correlation

between individual classifiers.

Random forests are comparable in accuracy to AdaBoost, yet are more robust to

errors and outliers. The generalization error for a forest converges as long as the num-

ber of trees in the forest is large. Thus, overfitting is not a problem. The accuracy of a

random forest depends on the strength of the individual classifiers and a measure of the

dependence between them. The ideal is to maintain the strength of individual classifiers

without increasing their correlation. Random forests are insensitive to the number of

attributes selected for consideration at each split. Typically, up to log 2 d C1 are chosen.

(An interesting empirical observation was that using a single random input attribute

may result in good accuracy that is often higher than when using several attributes.)

Because random forests consider many fewer attributes for each split, they are efficient

on very large databases. They can be faster than either bagging or boosting. Random

forests give internal estimates of variable importance.

.

i D 1, 2,

:::

8.6.5 Improving Classification Accuracy of Class-Imbalanced Data

In this section, we revisit the class imbalance problem . In particular, we study approaches

to improving the classification accuracy of class-imbalanced data.

Given two-class data, the data are class-imbalanced if the main class of interest (the

positive class) is represented by only a few tuples, while the majority of tuples represent

the negative class. For multiclass-imbalanced data, the data distribution of each class

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home