Collective Classification for Text Classification - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

documents divided into the four standard university splits (after discarding

the “other” split) containing webpages from Cornell, Texas, Wisconsin and

Washington. We also performed stemming and stop word removal to obtain a

vocabulary with 1703 distinct words. There are 1608 hyperlinks in the dataset

with 5 class labels. Note that webpages from one university do not link to

webpages from the other universities, which means that while performing four-

fold cross-validation using the university splits, we can only use the words in

the webpages to seed the inference process with. There are no observed labels

to bootstrap the inference.

This is not the case with Cora and CiteSeer

datasets.

The Cora dataset contains a number of Machine Learning papers divided

into one of 7 classes while the CiteSeer dataset has 6 class labels. For both

datasets, we performed stemming and stop word removal besides removing

the words with document frequency less than 10. The final corpus has 2708

documents, 1433 distinct words in the vocabulary and 5429 links, in the case

of Cora, and 3312 documents, 3703 distinct words in the vocabulary and 4732

links in the case of CiteSeer.

Unlike WebKB, the Cora and CiteSeer datasets do not have natural splits

in the data for use as test and training sets. To create splits, we use two sam-

pling strategies, random sampling and snowball sampling. Random sampling

(RS) is accomplished using the traditional k-fold cross-validation methodology

where we choose nodes randomly to create splits. In snowball sampling (SS),

we sample with a bias toward placing neighboring nodes in the same split.

We construct the splits by randomly selecting an initial node and expanding

around it. We do not expand randomly. We instead select nodes based on the

class distribution of the dataset; that is, the test data is stratified. Selected

nodes are used as the test set while the rest are used in the training set. We

repeat the sampling k times to obtain k test-train pairs of splits. We note that

when using SS, unlike in RS, some objects may appear in more than one test

splits. Consequently, we need to adjust accuracy computation so that objects

appearing multiple times are not over counted. We choose a simple strategy

where we first average the accuracy for each instance and then take the av-

erages of the averages. Also, to help the reader compare the results between

SS and RS strategies, we provide accuracies averaged per instance across only

instances which appear in test sets for both SS and RS (i.e., instances in at

least one SS test split). We denote these numbers using the term matched

cross-validation (M).

For each dataset, we performed both random sampling evaluation (with 10

splits) and snowball sampling evaluation (averaged over 10 runs).

3.6.2.1

Results

The accuracy results for the real world datasets are shown in Table 3.1 ,

Table 3.2 and Table 3.3 . The accuracies are separated by sampling method

and base classifier. The highest accuracy at each partition is in bold. We

Text Mining: Classification, Clustering, and Applications

Search WWH ::

Custom Search

Home