Information Technology Reference
In-Depth Information
Figure 5. Evaluation of the “ACM data set”
0%
%
0%
%
0%
%
0%
%
kNN
rel. kNN
rel. kNN + typo.
0%
0
0000
0000
0000
0000
0000
0000
0000
0000
0000
00000
0000
f eatures
The total improvement of our approach can be seen by comparing the bold curve and the thin curve
of the Figures 3, 4, and 5. The classification quality of our typography-weighted kNN classifier is up to
6% better than a purely frequency-based kNN classifier within all test collections. A further interesting
improvement can be seen in Figure 4: The bold curve shows that our typography-weighted kNN classifier
reaches its saturation substantially earlier than the purely frequency-based kNN classifier represented
by the thin plotted line. That is, our approach achieves the same classification quality as the purely fre-
quency-based kNN classifier with at most 30% of the features. We regard this significant performance
improvement as one of the major advantages of our improved typography-sensitive term weighting.
Additionally, the dashed curve shows the improvement, which is reached by the described relative
feature selection and the relative selection of the features per category alone. Figures 3 and 4 however
show that these approaches lead to a better total result when combined with typographical weighting.
Figure 5 also shows that typographical weighting does not always achieve an improvement. In case of
the ACM publications, this is caused by the fact that ACM publications contain only sparse typographic
information. For example, often only the title and the section headings are emphasized, some of which
are filtered by the feature selection because of their frequent occurrence, e.g., “abstract,” “introduction,”
“related work,” “summary,” “conclusion,” “references,” etc.
suMMar y and conclusion
Content-based classification of text documents is essential for intelligent document management sys-
tems. The crucial aspect of different classification algorithms for text documents is the classification
quality, which can be improved if the textual content is enriched with further information. However, for
the majority of text documents, enriched XML versions of the documents are not available. Therefore,
typographic information is the only additional information that these text documents offer beyond the
text itself. We have presented two typographical weighting approaches, and we have implemented and
Search WWH ::




Custom Search