Information Technology Reference
In-Depth Information
Table 7. Kappa statistics-based agreement between the evaluators, for Portuguese and English,
for Rvar and MI based metrics
Portuguese
English
LBM rvar
0.28
0.24
LM rvar
0.27
0.28
LBM MI
0.07
0.28
LM MI 0.19 0.22
“groups on ethics in science” that is not present in the Portuguese and English
versions of the same text. Similarly, “etiku ve vědě” is the accusative case for “etika
ve vědě”. Results obtained enable however a clear idea about the content of the
document. But evaluation, for languages as Czech and other languages having word
forms modified by case, still need to be deeply discussed or may require a post
extraction normalizer to bring phrases to nominative case.
6
Conclusions and Future Work
Our approach to key-term extraction problem (of both words and multi-words) is
language independent.
By ranking separately words and multi-words, using 20 metrics, based on 4 base
metrics, namely Tf-Idf, Phi Square, Rvar (relative variance) and MI (Mutual
Information), and by merging top ranked words' list with top ranked multi-words' list
taking into account the values assigned to each word and multi-word by each of the
metrics experimented we were able to make no discrimination between words and
multi-words, as both entities pass the same kind of sieve/metrics to be ranked as
adequate key-terms. This way, by comparing 12 metrics, just taking into account
word and multi-word based document representation, we could conclude that Tf-Idf
and Phi Square based metrics enabled better precision and recall than equivalent
precision/recall obtained by Rvar and MI based metrics that tend to extract rare terms.
This contradicts results obtained by [1,2].
As we wanted to extend our methodology to morphologically rich languages, we
introduced another document representation in terms of word prefixes and in that way
corroborated the conclusions made by [3] in their work, where Bubbled variants
showed interesting results for morphologically rich languages tested, especially for
Portuguese.
This other representation led us to the usage of 8 metrics based on the same 4
kernel metrics already mentioned. Experiments were made for Portuguese, English
and Czech. Higher precision obtained for Portuguese was obtained using two of the
metrics designed to handle prefix, word and multi-word representation. For Czech,
and even for English, results were not that spectacular but deserve further attention.
As a matter of fact, second best precision for the 5 top ranked key terms candidates,
both for Czech and for English was obtained by using Least Bubble Tf-Idf metric.
Search WWH ::




Custom Search