Information Technology Reference
In-Depth Information
weighting of ht Ml t ags
r elevant Information within hTML Tags
HTML tags within a document control the way in which the document will be displayed by a browser.
There are in principle three classes of tags: Logical, physical and meta-tags. So HTML allows words to
be enclosed either within physical tags (e.g., the tags <i> and </i> require that they are to be displayed
in italics) or within logical tags (e.g., <em> and </em> specify that the content is to be emphasized).
Meta-tags describe the HTML document and its contents.
Due to this diversity, it is necessary to develop an approach which correctly weights logical and
physical tags regarding in a manner consistent with their semantic relevance. For the definition of a
suitable evaluation basis, we have examined different criteria to evaluate emphasizing text, with the
purpose of finding the greatest possible approximation between objectively measured values and the
subjective impression of the weighted emphasis of text passages. Among these criteria, the subjective
impression of different font styles depends on the medium used (monitor, paper, etc.) and turned out to
be too difficult to weight. Similar difficulties arise in weighting the subjective impression of different
colors, which may depend on the medium used, the background color etc. Even worse, weighting colors
depending on contrast returns wrong results for signal colors, e.g., a red word among black words on
white background.
As a result, we found out that the number of pixels on a typical screen resolution is a much more
important relevance criterion than the previously mentioned criteria. The number of pixels is even more
important than text height or text width, because this criterion regards not only font size but also bold
faced printing. This has motivated us to develop an approach which evaluates the relevance of a text
passage based on the number of foreground pixels set.
In addition to the logical and physical HTML tags, which have direct influence on text representation
in the Web browser, there are still some tags which have only a describing function. For example, meta-
tags can contain recapitulating or supplementing information to the HTML document. Therefore, it is
generally advantageous to consult meta-tags for the weighting process. However, if a Web author abuses
a meta-tag in such a way that it has no semantic relationship to the context of a Web page, these meta-tags
do not help us, e.g., in document classification. As many Web pages contain unrelated key words, which
are often required by the users of search engines (Davison, 2000), we use meta-tags for classification
only if the words enclosed in the meta-tags are also contained in the document body text.
Computation of Weights
We have used a simple “nearest neighbor classifier” to show the performance of our tag weighting ap-
proach. Because of the lack of typographic research mentioned already in Hartley (1986), we have set
up the following rules of thumb for the computation of term weights. Within a parser run through the
document, all HTML tags are replaced with numeric weighting tags, which correspond to the weight
values of Table 4. Column one of Table 4 lists the HTML tags considered. Column two contains the
character font size corresponding to the respective HTML tag. The third column contains the square
root of the set foreground pixels of an example sentence that uses the respective tag style. Column four
and five contain the absolute and relative weights derived from the third column. The relative weight
(column five) for bold faced text is derived from the average weight difference between the font size
Search WWH ::




Custom Search