Supporting Text Retrieval by Typographical Term Weighting - Distributed Artificial Intelligence, Agent Technology, and Collaborative Applications

Information Technology Reference

In-Depth Information

Authors use typographical information in their texts to make them more readable. Therefore, we

follow the arguments of Apté et al. (1994), Cutler et al. (1997), Kim and Zhang (2000), and Kwon and

Lee (2000) that typographical information may help to classify or to better understand the meaning of

texts, which results in the following hypothesis that can be regarded as an extension to Luhn's thesis:

The justification of measuring word significance by typography is based on the fact that a writer normally

uses certain typographic styles to clarify his argumentation and the description of certain facts.

In order to verify our hypothesis, we have implemented our ideas within the VKC 1 document

management system. For an evaluation of the classification quality of our approach, we have used two

public data sets of the World Wide Knowledge Base (Web-Kb) project 2 , which contains HTML docu-

ments with typographical information and our own selection of publications in PDF format from the

ACM Digital Library 3 . The evaluation result is that classification algorithms that consider typography

information allow reducing the considered term set, thereby significantly improving the efficiency of

the automated document classification.

The remainder of the article is organized as follows. The second section describes some related

works. The third section outlines our previous HTML tag-based typographical weighting approach and

the fourth section describes our catalogue evaluation scenario and summarizes the performance results

of the tag based approach. Within the fifth section we describe our new general typography-based

weighting approach, which we evaluate in the sixth section. The seventh section outlines a summary

and the conclusions.

r ela ted works

Apté, Damerau and Weiss (1994) presented the first typographic term weighting approach for text

classification. They measured the classification quality of the “Reuters-21578 text categorization test

collection” 4 and demonstrated that by counting the terms of the news titles twice, an improvement of

nearly 2% (precision recall break even point) could be achieved.

Cutler, Shih and Meng (1997), for the first time, suggested an absolute weighting scheme for HTML

tags. By weighting words enclosed in tags depending on the tag weight (c.f. Table 1) the average preci-

Table 1. Absolute term weighting table by Cutler, Shih and Meng

HTML Tag

Tag Weight

<a>

1

8

1

<strong>, <b>, <em>, <i>,

1

<title> 0

Remaining tags and normal text 1

Search WWH ::

Custom Search

Home