Information Technology Reference
In-Depth Information
The favored weighting procedure considering a combination of absolute character size and relative
weighting of the remaining style characteristics is performed in two steps. In the first step, an absolute
weight is assigned to each text passage based on its character size. This weight can be assigned pro-
portional or in addition to its character size, in levels of the character sizes actually used. In the second
weighting step, the frequency of the remaining style combinations is counted separately for each used
character size. Depending on these combination frequencies, an offset is added to the weight previ-
ously computed based on character size. To prioritize character size, this offset must be smaller than the
difference to the next higher character size weight. In our experiments, we used half of the difference
to the next higher character size weight as the maximum for this offset. Thus, it is always guaranteed
that a larger character size weights more than all other style combinations together. Instead of increas-
ing typographical weight linearly on character size, a different weighting function could be used. We
could achieve the best results in our measurements with a weighting function that increases weighting
proportional to the square of the character size, which confirms our pixel-based weighting approach,
described in the Relevant Information within HTML Tags section.
The typographic term weighting can be done with a weighting table (see Table 6). During parsing of
the texts, the table entry, which corresponds to the typographic style of a parsed term, is incremented
for the parsed term. After the parsing process, a weighting table can be computed and numeric typo-
graphical weights can be assigned to the text passages in the document based on the collected statistical
information and the procedure described previously.
Figure 1 shows the typographical weighting of the author guidelines of Table 5 as a function of
character sizes which are normalized according to the highest assigned typographical weight. Note
that the sections of both author guidelines are weighted very differently. Due to the standardization,
the sections and subsections of the ACM author guidelines are not significantly stronger weighted than
continuous text.
The character sizes used in a document are also a design element, that is, they depend on document
design. Thereby, the semantic weight of a character size also depends on the design of the document.
Therefore, a weighting in the form of size levels, which are based on actually used character sizes, is
better than an absolute weighting of font sizes. For this purpose, all used character sizes are determined,
sorted ascending according to their size, and then assigned to, e.g., a weight which is linearly increased
Table 6. Typographical weighting table for the mixed approach
Style Size: /
Typeface:
1 p ... 9 p 10 p ... 64 p
normal
bold
italic
bold + italic
underlined
...
bold + italic
+ underlined
...
 
Search WWH ::




Custom Search