Information Technology Reference
In-Depth Information
Semi-Supervised Learning and Collective Classification plugin 2 .Intheremain-
der of this section we review the collective algorithms used in the empirical
evaluation.
2.1 CollectiveIBk
It uses internally WEKA's classic IBk algorithm, implementation of the K-
Nearest Neighbour (KNN), to determine the best k on the training set and
builds then, for all instances from the test set, a neighbourhood consisting of k
instances from the pool of train and test set (either a naıve search over the com-
plete set of instances or a k-dimensional tree is used to determine neighbours).
All neighbours in such a neighbourhood are sorted according to their distance
to the test instance they belong to. The neighbourhoods are sorted according to
their 'rank', where 'rank' means the different occurrences of the two classes in
the neighbourhood.
For every unlabelled test instance with the highest rank, the class label is de-
termined by majority vote or, in case of a tie, by the first class. This is performed
until no further unlabelled test instances remain. The classification terminates
by returning the class label of the instance that is about to be classified.
2.2 CollectiveForest
It uses WEKA's implementation of RandomTree as base classifier to divide the
test set into folds containing the same number of elements. The first iteration
trains using the original training set and generates the distribution for all the
instances in the test set. The best instances are then added to the original
training set (being the number of instances chosen the same as in a fold).
The next iterations train with the new training set and generate then the
distributions for the remaining instances in the test set.
2.3 CollectiveWoods and CollectiveTree
CollectiveWoods works like CollectiveForest using CollectiveTree instead of Ran-
domTree.
Collective tree is similar to WEKA's original RandomTree classifier, it splits
the attribute at that position that divides the current subset of instances (train-
ing and test instances) into two halves. The process finishes if one of the following
conditionsismet:
- Only training instances would be covered (the labels for these instances are
already known).
- Only test instances in the leaf, case in which distribution from the parent
node is taken.
- Only training instances of one class, case in which all test instances are
considered to have this class.
2 http://www.scms.waikato.ac.nz/ ￿ fracpete/projects/collectiveclassification
 
Search WWH ::




Custom Search