Information Technology Reference
In-Depth Information
Table 2. Absolute term weighting table by Kim and Zhang
HTML Tag
Tag Weight
<a>
1.7634
<h1>, <h2>, <h3>, <h4>,
<h5>, <h6>
2.3425
<b>
0.7060
<i>
1.0192
<title>
0.5584
Table 3. Absolute term weighting table by Kwon and Lee
HTML Tag
Tag Weight
<title>, <h1>, <meta „key-
word“>, <meta „description“>
4
<b>, <blink>, <h2-7>,
<strong>, <u>, <i>, <big>,
<dt>, <dfn>, <caption>,
<abstract>, <ul>, <alt>, <a>,
<strike>, <note>, <q>, <foot-
note>, <cite>, <era>, <ol>,
<option>, <role>
3
remaining tags and normal text 1
sion of their IR system was increased by nearly 7%.
By using a genetic algorithm for learning the tag weights Kim and Zhang (2000) determined a
similar weighting table (Table 2).
By measurements of a kNN classifier, Kwon and Lee (2000) determined a weighting table (Table 3)
for HTML tags. In combination with feature selection, they determined an improvement of the preci-
sion recall break even point of 14.7%. However, the exclusive employment of the tag weighting did not
yield an improvement of classification quality. The authors justify this with too strong a weighting of
noise terms.
Common to all related works concerning typographic term weighting is that they weight HTML tags
absolutely and use this as an additional factor in frequency based term weighting. The tag weighting is
based on measurements with test documents and maps HTML tags into a few weighted groups. A larger
character font size used for text enclosed in HTML tags leads to a stronger weighting of the enclosed text
in all these approaches. The disadvantage of these approaches is that the tag weighting is only based on
statistic measurements and not on typographic research, and that thereby these approaches are highly
depend on the used training documents. In contrast, we have suggested an absolute weighting approach
(Werner et al., 2005), which considers the weighting of HTML tags in the way which was intended by
the author of the HTML document.
Search WWH ::




Custom Search