Database Reference
In-Depth Information
documents divided into the four standard university splits (after discarding
the “other” split) containing webpages from Cornell, Texas, Wisconsin and
Washington. We also performed stemming and stop word removal to obtain a
vocabulary with 1703 distinct words. There are 1608 hyperlinks in the dataset
with 5 class labels. Note that webpages from one university do not link to
webpages from the other universities, which means that while performing four-
fold cross-validation using the university splits, we can only use the words in
the webpages to seed the inference process with. There are no observed labels
to bootstrap the inference.
This is not the case with Cora and CiteSeer
datasets.
The Cora dataset contains a number of Machine Learning papers divided
into one of 7 classes while the CiteSeer dataset has 6 class labels. For both
datasets, we performed stemming and stop word removal besides removing
the words with document frequency less than 10. The final corpus has 2708
documents, 1433 distinct words in the vocabulary and 5429 links, in the case
of Cora, and 3312 documents, 3703 distinct words in the vocabulary and 4732
links in the case of CiteSeer.
Unlike WebKB, the Cora and CiteSeer datasets do not have natural splits
in the data for use as test and training sets. To create splits, we use two sam-
pling strategies, random sampling and snowball sampling. Random sampling
(RS) is accomplished using the traditional k-fold cross-validation methodology
where we choose nodes randomly to create splits. In snowball sampling (SS),
we sample with a bias toward placing neighboring nodes in the same split.
We construct the splits by randomly selecting an initial node and expanding
around it. We do not expand randomly. We instead select nodes based on the
class distribution of the dataset; that is, the test data is stratified. Selected
nodes are used as the test set while the rest are used in the training set. We
repeat the sampling k times to obtain k test-train pairs of splits. We note that
when using SS, unlike in RS, some objects may appear in more than one test
splits. Consequently, we need to adjust accuracy computation so that objects
appearing multiple times are not over counted. We choose a simple strategy
where we first average the accuracy for each instance and then take the av-
erages of the averages. Also, to help the reader compare the results between
SS and RS strategies, we provide accuracies averaged per instance across only
instances which appear in test sets for both SS and RS (i.e., instances in at
least one SS test split). We denote these numbers using the term matched
cross-validation (M).
For each dataset, we performed both random sampling evaluation (with 10
splits) and snowball sampling evaluation (averaged over 10 runs).
3.6.2.1
Results
The accuracy results for the real world datasets are shown in Table 3.1 ,
Table 3.2 and Table 3.3 . The accuracies are separated by sampling method
and base classifier. The highest accuracy at each partition is in bold. We
Search WWH ::




Custom Search