Collective Classification for Text Classification - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

3.5 Learning the Classifiers

One aspect of the collective classification problem that we have not dis-

cussed so far is how to learn the various classifiers described in the previous

sections. Learning refers to the problem of determining the parameter val-

ues for the local classifier, in the case of ICA and GS, and the values in the

clique potentials, in the case of LBP and MF, which can then be subsequently

used to classify unseen test data. For all our experiments, we learned the pa-

rameter values from fully labeled datasets using gradient-based optimization

approaches. Unfortunately, a full treatment of this subject is not possible

within this article and we refer the interested reader to various other works

that discuss this in more depth such as (34), (31), (32).

3.6 Experimental Comparison

In our evaluation, we compared the four collective classification algorithms

(CC) discussed in the previous sections and a content-only classifier (CO),

which does not take the link structure into account, along with two choices

of local classifiers on document classification tasks. The two local classifiers

we tried were na ıve Bayes (NB) and Logistic Regression (LR). This gave us

8 different classifiers: CO with NB, CO with LR, ICA with NB, ICA with

LR, GS with NB, GS with LR, MF and LBP. The datasets we used for the

experiments included both real-world and synthetic datasets.

3.6.1 Features Used

For CO classifiers, we used the words in the documents for observed at-

tributes. In particular, we used a binary value to indicate whether or not

a word appears in the document. In ICA and GS, we used the same local

attributes (i.e., words) followed by count aggregation to count the number of

each label value in a node's neighborhood. Finally, for LBP and MF, we used

pairwise Markov Random Fields with clique potentials defined on the edges

and unobserved nodes in the network.

3.6.2 Real-World Datasets

We experimented with three real-world datasets: Cora and CiteSeer (two

bibliographic datasets), and WebKB (a hypertext dataset). For the WebKB

experiments, we only considered documents which link to or are linked to by

at least one other webpage in the corpus. This gave us a corpus of size 877

Search WWH ::

Custom Search

Home