Supporting Text Retrieval by Typographical Term Weighting - Distributed Artificial Intelligence, Agent Technology, and Collaborative Applications

Information Technology Reference

In-Depth Information

factor described by Kwon and Lee (2000) as the similarity measure. The weighting of the document

terms results from the product of our typographical weighting approach described before and the clas-

sical inverse document frequency (IDF).

In order to make our results more comparable to other publications, we have considered no informa-

tion about document hierarchy. Furthermore, we have slightly modified the statistic feature selection

methods “mutual information” (MI) and “chi square” (Yang & Pedersen, 1997), both of which are based

on the binary decision as to whether a document contains a term or not, as follows. We have replaced

the binary probabilities in the feature selection methods with the numeric term probabilities because

these methods do not evaluate the real relevance of a term in a document and our typographical weights

would not affect the feature evaluation process.

Due to the lack of freely available test collections with typographically enriched text documents,

we have evaluated our approach using the freely available HTML test collections “4 Universities Data

Set” and “7sectors Data Set” from the “CMU World Wide Knowledge Base” (Web-KB) project and

our own selection of publications from the “ACM Digital Library” in PDF format. In order to get the

typographic values of the documents of the HTML test collections, the HTML code was rendered by

the JEditorPane of the J2SE 5.0 5 , before the beginning of the typographic evaluation. This guarantees

that HTML documents are evaluated in the same way as PDF or other document formats.

Following the setup in Nigam et al. (1998), we only used the classes “course,” “faculty,” “project,”

and “student” for the evaluation of the “4 Universities Data Set,” and we used the pages from Cornell

University for testing, while all other pages were used as training base.

We have evaluated the “7sectors Data Set” by randomly selecting 80% of the documents for the

classification tests and using the remaining 20% as the training base.

For the ACM test collection, we used retrieval queries to the “ACM Digital Library” with labels

from the “CCS classification tree” 6 in order to index at most 20 PDF documents for each class. Of these

documents, 25% were used for the classification tests.

In our evaluation, we want to show the improvement which is possible by the usage of typographic

information contained in text documents. So we have removed documents without typographic infor-

mation from all test collections, since they would have led to false results otherwise. In all tests, we

only considered categories containing at least six documents after the removal of the documents for

classification tests and the deletion of the documents without typographic information. Details of the

resulting test collections can be inferred from Table 7.

The global MI evaluation proved to be the best method for the selection of the features of the “4

Universities Data Set,” however, the local chi square evaluation was better for the other two test collec-

tions. Unlike other approaches, we did not select a constant number of features per category. Instead, we

selected the number of features depending on the sum of the best feature weights of the current category.

The higher the best weights of a category, the fewer terms of this category used for classification.

We used the micro averaged precision recall break even point (Lewis, 1992), a usual performance

measure for text classification, for evaluating the classification tests. This measure is shown depend-

ing on the number of selected features in the following Figures, 3 to 5, each of which contains three

curves. The thin curve shows the classification quality of a conventional purely frequency-based kNN

classifier. The dashed curve shows the quality of a frequency-based kNN classifier with relative feature

selection, and the relative selection of the features per category for the chi square feature selection,

described before. The bold curve shows the quality of this kNN classifier with additional consideration

of our typographical weights.

Distributed Artificial Intelligence, Agent Technology, and Collaborative Applications

Search WWH ::

Custom Search

Home