Information Technology Reference
In-Depth Information
r elative Weighting of hTML Tags
The approach described previously, for the first time, tries to weight typography information from text
documents in the way that was intended by the author. The weighting in Table 4 however makes some
general assumptions about the design of HTML documents which can easily lead to a false estimation
of the relevance weights of HTML tags. For example, the weighting in Table 4 assumes that body texts
are always represented in normal script. This may not be true. Some authors may use italic style in the
body text for design reasons and make emphases by using normal script. The same principle applies
also for the usage of other typefaces.
In order to solve this problem, it was necessary to develop a weighting approach based on a few
generally valid typography rules. These weight the text passages contained in a document relative to
the document design. Such a relative weighting approach, as described in the following for common
text documents, is also possible for HTML documents, but was not evaluated by us.
t yPogra Phic t er M weighting
All of the approaches described above, obtain typography information from HTML tags and are thereby
are directly applicable only to HTML documents. Theoretically, it would be possible to convert dif-
ferent document formats into the HTML format and weight them afterwards as described. However,
such a conversion into HTML format is time consuming and a source of errors. Therefore, we present
a general approach to typographic term weighting which is applicable to ordinary text documents in
the following sections.
Typography in Text documents
Typographic techniques have been used since the invention of letters for emphasizing certain text frag-
ments and for the design of texts, and are used naturally today in the design process of text documents.
Templates of modern word processors and publication guidelines issued by publishers lead to similar
typography in text documents. In Table 5, excerpts from the author guidelines of the “Lecture Notes in
Informatics (LNI)” of the German Society for Computer Science (GI) are compared with excerpts from
the author guidelines of the “Association for Computing Machinery” (ACM).
Approaches to Typographic Term Weighting
Common to all author guidelines is that text paragraphs are more important if they are typographically
emphasized. Table 5 shows that font size is significant for the weighting of text passages: The larger
the character font, the more relevant is the passage to the text. The abstract is an exception to this rule
because it uses smaller character size than continuous text. However, we consider this not to be a source
of failure because the abstract usually uses terms that are repeated in the continuous text. The remain-
ing typographic styles are not uniformly used and are highly design dependent. Therefore, different
approaches to automated term weighting based on typography information are possible:
Search WWH ::




Custom Search