Information Technology Reference
In-Depth Information
It is important to notice also that documents were in many cases short. This has a
direct impact on results, as the number of relevant words and multi-words is short and
most of them are irrelevant in terms of document content. As a consequence precision
obtained for shorter documents is lower than for longer documents as most of the times
just one term describes document content. Longer documents pose not this problem.
Table 1. Top terms ranked by Phi-Square metric, manually classified as Good (G), Near Good
(NG) or Bad (B), for 3 languages for a document on scientific and technical information and
documentation.
Portuguese
English
Czech
ciências e as novas tecnologias (G) group on ethics (G)
skupiny pro etiku ve vědě (G)
ciências e as novas (B)
ethics (G)
nových technologiích (NG)
ética para as ciências (G)
science and new technologies (G) etiku ve vědě (G)
grupo europeu de ética (G)
the ege (G)
skupiny pro etiku (G)
membros do gee (G)
ethics in science (G)
vědě a nových technologiích (NG)
In the previous table, some top-ranked key terms are a sub terms of others. This
has some effect on the results, because they are not mutually independent. Looking
more carefully we may also notice larger, more specific, multi-words that might be
rather sharp descriptors of document content as would be the case of “group on ethics
in science and new technologies”. We will return to this discussion on section 6.
For the same document, best performing metric based on Rvar (see Table 3)
LBM_Rvar just extracted “ethics” in position 20. Other extracted top terms include
names of several European personalities, having as such very poor extraction results.
In tables 2 and 3, average precision values obtained for the 5, 10 and 20 best
ranked key terms extracted using different metrics are shown.
Regarding recall values presented in tables 4 and 5, it is necessary to say that: 1)
Tf-Idf, Phi Square and derived metrics extract very similar key terms; 2) Rvar and
MI, alone, are unable to extract key terms as, depending on the length of documents,
the top ranked 100, 200 or more terms are equally valuated by these metrics; 3)
derived metrics of Rvar and MI extract very similar rare key terms completely
dissimilar from those extracted by Tf-Idf, Phi Square and derived metrics; 4) by
evaluating the 25 best ranked terms by 6 metrics (Phi Square, Least Tf-Idf, Least
Median Rvar, Least Median MI, Least Bubble Median Phi Square and Least Bubble
Median Rvar) we obtained from 60 to 70 terms evaluated per document.
Recall was determined on the basis of these 60 to 70 evaluated terms. So, recall
values presented in tables (4) and (5) are upper bounds of real recall values. Table 2
shows results for the metrics with best precision for the three languages, all of them
with results above 0.50. Notice, that for Portuguese and Czech, the average precision
is similar. The best results were obtained for the top ranked 5 terms, decreasing with
similar values when dealing with the top ranked 10 and 20 terms. In average, English
language presents the best results.
Also from Table 2 we can point out that, for Portuguese, best results were obtained
with metrics Least Bubble Tf-Idf and Least Bubble Median Tf-Idf. This means that
Bubble operator and prefix representation enabled precision results closer to those
obtained for English. Tf-Idf had the best results in Czech, for all thresholds. In
English, Least Median Phi Square enabled the best results. Moreover, for the 10 best
Search WWH ::




Custom Search