Supporting Text Retrieval by Typographical Term Weighting - Distributed Artificial Intelligence, Agent Technology, and Collaborative Applications

Information Technology Reference

In-Depth Information

Table 2. Absolute term weighting table by Kim and Zhang

HTML Tag

Tag Weight

<a>

1.7634

<h1>, <h2>, <h3>, <h4>,

2.3425

<b>

0.7060

<i>

1.0192

<title>

0.5584

Table 3. Absolute term weighting table by Kwon and Lee

HTML Tag

Tag Weight

<title>, <h1>, <meta „key-

word“>, <meta „description“>

4

<b>, <blink>, <h2-7>,

<strong>, <u>, <i>, <big>,

<dt>, <dfn>, <caption>,

<abstract>, <ul>, <alt>, <a>,

<strike>, <note>, <q>, <foot-

note>, <cite>, <era>, <ol>,

3

remaining tags and normal text 1

sion of their IR system was increased by nearly 7%.

By using a genetic algorithm for learning the tag weights Kim and Zhang (2000) determined a

similar weighting table (Table 2).

By measurements of a kNN classifier, Kwon and Lee (2000) determined a weighting table (Table 3)

for HTML tags. In combination with feature selection, they determined an improvement of the preci-

sion recall break even point of 14.7%. However, the exclusive employment of the tag weighting did not

yield an improvement of classification quality. The authors justify this with too strong a weighting of

noise terms.

Common to all related works concerning typographic term weighting is that they weight HTML tags

absolutely and use this as an additional factor in frequency based term weighting. The tag weighting is

based on measurements with test documents and maps HTML tags into a few weighted groups. A larger

character font size used for text enclosed in HTML tags leads to a stronger weighting of the enclosed text

in all these approaches. The disadvantage of these approaches is that the tag weighting is only based on

statistic measurements and not on typographic research, and that thereby these approaches are highly

depend on the used training documents. In contrast, we have suggested an absolute weighting approach

(Werner et al., 2005), which considers the weighting of HTML tags in the way which was intended by

the author of the HTML document.

Distributed Artificial Intelligence, Agent Technology, and Collaborative Applications

Search WWH ::

Custom Search

Home