Information Technology Reference
In-Depth Information
Authors use typographical information in their texts to make them more readable. Therefore, we
follow the arguments of Apté et al. (1994), Cutler et al. (1997), Kim and Zhang (2000), and Kwon and
Lee (2000) that typographical information may help to classify or to better understand the meaning of
texts, which results in the following hypothesis that can be regarded as an extension to Luhn's thesis:
The justification of measuring word significance by typography is based on the fact that a writer normally
uses certain typographic styles to clarify his argumentation and the description of certain facts.
In order to verify our hypothesis, we have implemented our ideas within the VKC 1 document
management system. For an evaluation of the classification quality of our approach, we have used two
public data sets of the World Wide Knowledge Base (Web-Kb) project 2 , which contains HTML docu-
ments with typographical information and our own selection of publications in PDF format from the
ACM Digital Library 3 . The evaluation result is that classification algorithms that consider typography
information allow reducing the considered term set, thereby significantly improving the efficiency of
the automated document classification.
The remainder of the article is organized as follows. The second section describes some related
works. The third section outlines our previous HTML tag-based typographical weighting approach and
the fourth section describes our catalogue evaluation scenario and summarizes the performance results
of the tag based approach. Within the fifth section we describe our new general typography-based
weighting approach, which we evaluate in the sixth section. The seventh section outlines a summary
and the conclusions.
r ela ted works
Apté, Damerau and Weiss (1994) presented the first typographic term weighting approach for text
classification. They measured the classification quality of the “Reuters-21578 text categorization test
collection” 4 and demonstrated that by counting the terms of the news titles twice, an improvement of
nearly 2% (precision recall break even point) could be achieved.
Cutler, Shih and Meng (1997), for the first time, suggested an absolute weighting scheme for HTML
tags. By weighting words enclosed in tags depending on the tag weight (c.f. Table 1) the average preci-
Table 1. Absolute term weighting table by Cutler, Shih and Meng
HTML Tag
Tag Weight
<a>
1
<h1>, <h2>
8
<h3>, <h4>, <h5>, <h6>
1
<strong>, <b>, <em>, <i>,
<u>, <dl>, <ol>, <ul>
1
<title> 0
Remaining tags and normal text 1
Search WWH ::




Custom Search