Collective Classification for Spam Filtering - Computational Intelligence in Security for Information Systems

Information Technology Reference

In-Depth Information

Semi-Supervised Learning and Collective Classification plugin 2 .Intheremain-

der of this section we review the collective algorithms used in the empirical

evaluation.

2.1 CollectiveIBk

It uses internally WEKA's classic IBk algorithm, implementation of the K-

Nearest Neighbour (KNN), to determine the best k on the training set and

builds then, for all instances from the test set, a neighbourhood consisting of k

instances from the pool of train and test set (either a naıve search over the com-

plete set of instances or a k-dimensional tree is used to determine neighbours).

All neighbours in such a neighbourhood are sorted according to their distance

to the test instance they belong to. The neighbourhoods are sorted according to

their 'rank', where 'rank' means the different occurrences of the two classes in

the neighbourhood.

For every unlabelled test instance with the highest rank, the class label is de-

termined by majority vote or, in case of a tie, by the first class. This is performed

until no further unlabelled test instances remain. The classification terminates

by returning the class label of the instance that is about to be classified.

2.2 CollectiveForest

It uses WEKA's implementation of RandomTree as base classifier to divide the

test set into folds containing the same number of elements. The first iteration

trains using the original training set and generates the distribution for all the

instances in the test set. The best instances are then added to the original

training set (being the number of instances chosen the same as in a fold).

The next iterations train with the new training set and generate then the

distributions for the remaining instances in the test set.

2.3 CollectiveWoods and CollectiveTree

CollectiveWoods works like CollectiveForest using CollectiveTree instead of Ran-

domTree.

Collective tree is similar to WEKA's original RandomTree classifier, it splits

the attribute at that position that divides the current subset of instances (train-

ing and test instances) into two halves. The process finishes if one of the following

conditionsismet:

- Only training instances would be covered (the labels for these instances are

already known).

- Only test instances in the leaf, case in which distribution from the parent

node is taken.

- Only training instances of one class, case in which all test instances are

considered to have this class.

2 http://www.scms.waikato.ac.nz/ fracpete/projects/collectiveclassification

Search WWH ::

Custom Search

Home