Information Technology Reference
In-Depth Information
factor described by Kwon and Lee (2000) as the similarity measure. The weighting of the document
terms results from the product of our typographical weighting approach described before and the clas-
sical inverse document frequency (IDF).
In order to make our results more comparable to other publications, we have considered no informa-
tion about document hierarchy. Furthermore, we have slightly modified the statistic feature selection
methods “mutual information” (MI) and “chi square” (Yang & Pedersen, 1997), both of which are based
on the binary decision as to whether a document contains a term or not, as follows. We have replaced
the binary probabilities in the feature selection methods with the numeric term probabilities because
these methods do not evaluate the real relevance of a term in a document and our typographical weights
would not affect the feature evaluation process.
Due to the lack of freely available test collections with typographically enriched text documents,
we have evaluated our approach using the freely available HTML test collections “4 Universities Data
Set” and “7sectors Data Set” from the “CMU World Wide Knowledge Base” (Web-KB) project and
our own selection of publications from the “ACM Digital Library” in PDF format. In order to get the
typographic values of the documents of the HTML test collections, the HTML code was rendered by
the JEditorPane of the J2SE 5.0
5
, before the beginning of the typographic evaluation. This guarantees
that HTML documents are evaluated in the same way as PDF or other document formats.
Following the setup in Nigam et al. (1998), we only used the classes “course,” “faculty,” “project,”
and “student” for the evaluation of the “4 Universities Data Set,” and we used the pages from Cornell
University for testing, while all other pages were used as training base.
We have evaluated the “7sectors Data Set” by randomly selecting 80% of the documents for the
classification tests and using the remaining 20% as the training base.
For the ACM test collection, we used retrieval queries to the “ACM Digital Library” with labels
from the “CCS classification tree”
6
in order to index at most 20 PDF documents for each class. Of these
documents, 25% were used for the classification tests.
In our evaluation, we want to show the improvement which is possible by the usage of typographic
information contained in text documents. So we have removed documents without typographic infor-
mation from all test collections, since they would have led to false results otherwise. In all tests, we
only considered categories containing at least six documents after the removal of the documents for
classification tests and the deletion of the documents without typographic information. Details of the
resulting test collections can be inferred from Table 7.
The global MI evaluation proved to be the best method for the selection of the features of the “4
Universities Data Set,” however, the local chi square evaluation was better for the other two test collec-
tions. Unlike other approaches, we did not select a constant number of features per category. Instead, we
selected the number of features depending on the sum of the best feature weights of the current category.
The higher the best weights of a category, the fewer terms of this category used for classification.
We used the micro averaged precision recall break even point (Lewis, 1992), a usual performance
measure for text classification, for evaluating the classification tests. This measure is shown depend-
ing on the number of selected features in the following Figures, 3 to 5, each of which contains three
curves. The thin curve shows the classification quality of a conventional purely frequency-based kNN
classifier. The dashed curve shows the quality of a frequency-based kNN classifier with relative feature
selection, and the relative selection of the features per category for the chi square feature selection,
described before. The bold curve shows the quality of this kNN classifier with additional consideration
of our typographical weights.
Search WWH ::
Custom Search