Databases Reference
In-Depth Information
is a “forest.” The individual decision trees are generated using a random selection of
attributes at each node to determine the split. More formally, each tree depends on the
values of a random vector sampled independently and with the same distribution for
all trees in the forest. During classification, each tree votes and the most popular class is
returned.
Random forests can be built using bagging (Section 8.6.2) in tandem with random
attribute selection. A training set, D , of d tuples is given. The general procedure to gen-
erate k decision trees for the ensemble is as follows. For each iteration, i
, k ),
a training set, D i , of d tuples is sampled with replacement from D . That is, each D i is a
bootstrap sample of D (Section 8.5.4), so that some tuples may occur more than once
in D i , while others may be excluded. Let F be the number of attributes to be used to
determine the split at each node, where F is much smaller than the number of avail-
able attributes. To construct a decision tree classifier, M i , randomly select, at each node,
F attributes as candidates for the split at the node. The CART methodology is used to
grow the trees. The trees are grown to maximum size and are not pruned. Random
forests formed this way, with random input selection , are called Forest-RI.
Another form of random forest, called Forest-RC, uses random linear combinations
of the input attributes. Instead of randomly selecting a subset of the attributes, it cre-
ates new attributes (or features) that are a linear combination of the existing attributes.
That is, an attribute is generated by specifying L , the number of original attributes to be
combined. At a given node, L attributes are randomly selected and added together with
coefficients that are uniform random numbers on [1, 1]. F linear combinations are
generated, and a search is made over these for the best split. This form of random forest
is useful when there are only a few attributes available, so as to reduce the correlation
between individual classifiers.
Random forests are comparable in accuracy to AdaBoost, yet are more robust to
errors and outliers. The generalization error for a forest converges as long as the num-
ber of trees in the forest is large. Thus, overfitting is not a problem. The accuracy of a
random forest depends on the strength of the individual classifiers and a measure of the
dependence between them. The ideal is to maintain the strength of individual classifiers
without increasing their correlation. Random forests are insensitive to the number of
attributes selected for consideration at each split. Typically, up to log 2 d C1 are chosen.
(An interesting empirical observation was that using a single random input attribute
may result in good accuracy that is often higher than when using several attributes.)
Because random forests consider many fewer attributes for each split, they are efficient
on very large databases. They can be faster than either bagging or boosting. Random
forests give internal estimates of variable importance.
.
i D 1, 2,
:::
8.6.5 Improving Classification Accuracy of Class-Imbalanced Data
In this section, we revisit the class imbalance problem . In particular, we study approaches
to improving the classification accuracy of class-imbalanced data.
Given two-class data, the data are class-imbalanced if the main class of interest (the
positive class) is represented by only a few tuples, while the majority of tuples represent
the negative class. For multiclass-imbalanced data, the data distribution of each class
 
Search WWH ::




Custom Search