Information Technology Reference
In-Depth Information
Chapter XI
Supporting Text Retrieval by
Typographical Term Weighting
Lars Werner
University of Paderborn, Germany
Stefan Böttcher
University of Paderborn, Germany
Text documents stored in information systems usually consist of more information than the pure concat-
enation of words, i.e., they also contain typographic information. Because conventional text retrieval
methods evaluate only the word frequency, they miss the information provided by typography, e.g.,
regarding the importance of certain terms. In order to overcome this weakness, we present an approach
which uses the typographical information of text documents and show how this improves the efficiency
of text retrieval methods. Our approach uses weighting of typographic information in addition to term
frequencies for separating relevant information in text documents from the noise. We have evaluated
our approach on the basis of automated text classification algorithms. The results show that our weight-
ing approach achieves very competitive classification results using at most 30% of the terms used by
conventional approaches, which makes our approach significantly more efficient.
Text documents combine textual and typographical information. However, since Luhn (1958), information
retrieval (IR) algorithms use only term frequency in text documents for measuring the text significance,
i.e., typographic information also contained in the texts is not considered by most of the common IR
methods. Typographic information includes the employment of different character fonts, character sizes
and styles, the choice of line length, text alignment and the type-area within the paper format.
Search WWH ::

Custom Search