Supporting Text Retrieval by Typographical Term Weighting - Distributed Artificial Intelligence, Agent Technology, and Collaborative Applications

Information Technology Reference

In-Depth Information

weighting of ht Ml t ags

r elevant Information within hTML Tags

HTML tags within a document control the way in which the document will be displayed by a browser.

There are in principle three classes of tags: Logical, physical and meta-tags. So HTML allows words to

be enclosed either within physical tags (e.g., the tags <i> and </i> require that they are to be displayed

in italics) or within logical tags (e.g., <em> and </em> specify that the content is to be emphasized).

Meta-tags describe the HTML document and its contents.

Due to this diversity, it is necessary to develop an approach which correctly weights logical and

physical tags regarding in a manner consistent with their semantic relevance. For the definition of a

suitable evaluation basis, we have examined different criteria to evaluate emphasizing text, with the

purpose of finding the greatest possible approximation between objectively measured values and the

subjective impression of the weighted emphasis of text passages. Among these criteria, the subjective

impression of different font styles depends on the medium used (monitor, paper, etc.) and turned out to

be too difficult to weight. Similar difficulties arise in weighting the subjective impression of different

colors, which may depend on the medium used, the background color etc. Even worse, weighting colors

depending on contrast returns wrong results for signal colors, e.g., a red word among black words on

white background.

As a result, we found out that the number of pixels on a typical screen resolution is a much more

important relevance criterion than the previously mentioned criteria. The number of pixels is even more

important than text height or text width, because this criterion regards not only font size but also bold

faced printing. This has motivated us to develop an approach which evaluates the relevance of a text

passage based on the number of foreground pixels set.

In addition to the logical and physical HTML tags, which have direct influence on text representation

in the Web browser, there are still some tags which have only a describing function. For example, meta-

tags can contain recapitulating or supplementing information to the HTML document. Therefore, it is

generally advantageous to consult meta-tags for the weighting process. However, if a Web author abuses

a meta-tag in such a way that it has no semantic relationship to the context of a Web page, these meta-tags

do not help us, e.g., in document classification. As many Web pages contain unrelated key words, which

are often required by the users of search engines (Davison, 2000), we use meta-tags for classification

only if the words enclosed in the meta-tags are also contained in the document body text.

Computation of Weights

We have used a simple “nearest neighbor classifier” to show the performance of our tag weighting ap-

proach. Because of the lack of typographic research mentioned already in Hartley (1986), we have set

up the following rules of thumb for the computation of term weights. Within a parser run through the

document, all HTML tags are replaced with numeric weighting tags, which correspond to the weight

values of Table 4. Column one of Table 4 lists the HTML tags considered. Column two contains the

character font size corresponding to the respective HTML tag. The third column contains the square

root of the set foreground pixels of an example sentence that uses the respective tag style. Column four

and five contain the absolute and relative weights derived from the third column. The relative weight

(column five) for bold faced text is derived from the average weight difference between the font size

Search WWH ::

Custom Search

Home