Information Technology Reference
In-Depth Information
Again, Tf-Idf and Phi Square derived metrics were the best performing. Also, it is
worth to mention that the Bubble operator enabled some improvements in the results
obtained when applied to Rvar and MI metrics. It is worth noticing that, for Portuguese
and Czech, for some metrics, precision augmented when we considered top 10 and even
top 20 ranked extracted terms in relation to top 5 ranked ones. For Czech that occurred
for Least Bubbled Median Tf-Idf and Least Bubbled Median Phi-Square. For Portuguese
it was the case for Least Tf-Idf and Least Bubbled Phi-Square.
In future work we will mainly explore Tf-Idf and Phi-square metrics and their de
rivatives. Then we must take a greater care of the length of texts evaluated. As a
matter of fact, for a large text it may make sense an evaluation with 5, 10 or 20 best
ranked terms. But for smaller texts taking just the 5 best ranked terms may affect
negatively the mean precision of all documents as, in such cases, at most 2 or 3 best
ranked terms will probably exhaust good possibilities for document content
descriptors.
In what concerns human evaluation we will make an effort for better preparing this
work phase in order to overcome evaluation disagreement by discussing the criteria to
be used by evaluators while making them explicit.
Regarding the problem identified in section 5 related to having multi-words that
are not independent, we must take greater care on this problem, knowing that it is not
that easy to solve. Take another example of extracted good descriptors using Phi-
Square metric from document 32006D0688 ( in http://eur-lex.europa.eu/en/
legis/latest/chap1620.htm ). Below are the terms classified as good:
asylum
asylum and immigration
immigration
areas of asylum and immigration
areas of asylum
national asylum
If we filter out multi-words that are sub multi-words of larger multi-words., in the
example above we would have got rid of “asylum and immigration” and “areas of
asylum”. But as you see other filtering possibilities might be used. So this must be
cautiously addressed. As a matter of fact we are not so sure that a long key term
(5-gram) as “areas of asylum and immigration” is a better descriptor than “asylum
and immigration”. Equivalently, it might be extrapolated for the example shown in
Table 1, that multiword “group on ethics in science and new technologies”, that might
be recaptured by binding top ranked multi-words having identical extremities is
possibly a good descriptor. But again some care must be taken. If we want to directly
extract longer multi-words as that “group on ethics in science and new technologies”
we just need to fix the maximum multiword length, this has computational cost. For
this work it was fixed at 5.
Concerning Czech, a stricter evaluation would not accept some of the terms that
were taken as good as they were case marked and should not be. This will certainly
require some language dependent tool filtering. That is more complex than simple
lemmatization of words.
Search WWH ::




Custom Search