MINING KNOWLEDGE FROM NETWORK INTRUSION DATA USING DATA MINING TECHNIQUES - Knowledge Mining Using Intelligent Agents

Databases Reference

In-Depth Information

of being drawn in each trial. The Bagging algorithm shown in Fig. 6.7

does exactly this- N times; the algorithm chooses a number r at random

from 1 to N and adds the r th training example to the bootstrap training

set S . Clearly, some of the original training examples will not be selected

for inclusion in the bootstrapped training set and others will be chosen

one time or more. On average, each generated bootstrapped training set

will contain 0.63 N unique training examples even though it will contain N

actual training examples. In bagging, we create M such bootstrap training

sets and then generate classifiers using each of them. Bagging returns a

function h ( x ) that classifies new examples by returning the class y that

gets the maximum number of votes from the base models

.

In bagging, the M bootstrap training sets that are created are likely to

have some differences. If these differences are enough to induce noticeable

differences among the M base models while leaving their performances

reasonably good, then the ensemble will probably perform better than the

base models individually.

{

h 1 ,h 2 ,...,h M }

Random Forest:

Random forest is an ensemble of unpruned classifi-

cation or regression trees, induced from bootstrap samples of the training

data, using random feature selection in the tree induction process.

Prediction is made by aggregating (majority vote for classification or

averaging for regression) the predictions of the ensemble. Random forest

generally exhibits a substantial performance improvement over the single

tree classifier such as CART and C4.5. It yields generalization error rate

that compares favorably to AdaBoost, yet is more robust to noise. However,

similar to most classifiers, random forest can also suffer from the curse of

learning from extremely imbalanced training data set. As it is constructed to

minimize the overall error rate, it will tend to focus more on the prediction

accuracy of the majority class, which often results in poor accuracy, for the

minority class.

In random forests, there is no need for cross validation or a test set to

get an unbiased estimate of the test error. Since each tree is constructed

using the bootstrap sample, approximately 3

rd of the cases are left out of

the bootstrap samples and not used in training. These cases are called out

of bag (oob) cases. These oob cases are used to get a run-time unbiased

estimate of the classification error as trees are added to the forest.

The error rate of a forest depends on the correlation between any two

trees and the strength of each tree in the forest. Increasing the correlation

Search WWH ::

Custom Search

Home