Information Technology Reference
In-Depth Information
Third module is also a user interface acting as a back office application, allowing
an easy interpretation of the classifications produced by the external evaluators. It also
shows graphically the k-Statistics resulting from evaluations of any two evaluators.
We worked with a collection of texts, for the three languages experimented,
Portuguese (pt), English (en) and Czech (cs), from European legislation ( http://eur-
lex.europa.eu/[L]/legis/latest/chap16.htm , where [L] can be replaced by any of the
following language references: pt, en or cs). The texts were about Science,
Dissemination of information and Education and Training. Czech corpus also
included texts about Culture. Apart from these texts for Czech, the remaining of the
corpus documents was parallel for the three languages. So the total number of terms
for these collections was: 109449 for Portuguese, 100890 for English and 120787 for
Czech.
We worked with words having a minimum length of six characters (this parameter
is changeable) and filtered multi-words (with words of any length) by removing those
containing punctuation marks, numbers and other special symbols. As it will be seen
later some additional filtering operations will be required and discussed in the
conclusions section.
Evaluators were asked to evaluate the 25 best ranked terms for a selected sub-
group of six of the twenty metrics for a sub-set of five randomly selected documents
of a total of 28 documents for each language. The Evaluators had access to the
original documents from where key-words were extracted. When document metadata
contained the keywords assigned to it, evaluators had also access to this information
thus helping the evaluation task. It is worth telling that when this metadata exists, it
generally does not use mutatis mutantis the multi-word terms as they are used in the
document. This information was not used for the extraction task performed.
Four classifications were considered in the evaluation: “good topic descriptor” (G),
“near good topic descriptor” (NG), bad topic descriptor” (B), and “unknown” (U). A
fifth classification, “not evaluated” (NE), was required to enable the propagation of
evaluation, for those metrics that were not specifically evaluated. In Section 5 the
results of the experiments are presented.
It must be stressed that the multiword extractor used is available on the web page
referred in [18].
4
Metrics Used
As mentioned before, we used 4 basic metrics: Tf-Idf, Phi-square, Relative Variance
(Rvar) and Mutual Information (MI).
Formally, Tf-Idf for a term t in a document dj is defined in equations (1), (2) and (3).
, , ,
(1)
, ,
(2)
(3)
, log :
Notice that, in (1), instead of using the usual term frequency factor, probability
, , defined in equation (2), is used. There, , , denotes the frequency of
Search WWH ::




Custom Search