Information Technology Reference
In-Depth Information
metrics Tf-Idf, Rvar, , and MI as depicted in equation (9), when a multiword MW
is at stake. As above, Least_MT of a multiword will be equal to the minimum of
the MT metric value for the extremity words, or , in the multi-word . This
operator was adapted to work with words alone as in equation (8), where the
Least_MT for a word is identical to the rank value obtained for that word using
metric MT. Adaptation was made by assuming that the leftmost and rightmost words
of a single word coincide with the word itself.
_ (8)
_ , (9)
Bubbled Operator , another problem we needed to solve was the propagation of the
relevance of each Prefix (P) to words (W) having P as a prefix.
_ (10)
__ _ ,_ (11)
In bubble based metrics, the rank of a prefix is assigned to the words it prefixes.
Generally it is larger than the rank assigned by the corresponding metric to the word
forms it prefixes. For example, the value assigned to the 5 character prefix “techn” in
a text would be propagated to all words having that prefix, namely “technology”,
“technologies”, “techniques”, if they would appear in the same text.
Median Operator was added in order to better compare the effects of using an
operator similar to the one proposed in [2] which took into account the median character
length of words in multi-words. By doing so, we got metrics defined in equations (12)
and (13), where T represents a term (word or multi-word), LM stands for Least_Median
operator applied to any base metric MT and LBM stands for Least_Bubble_Median
operator applied to metric MT . And Median of a term T is the median of character
lengths of words in a multi-word or of the word at stake. For example, for a multiword
made of three words, of lengths 5, 2 and 6, median length is 5.
_ _
(12)
_ __
(13)
5
Results
In this section we present some of the results obtained. We will also show that Rvar
and its related metrics behave worse than the ones based on Tf-Idf and Phi Square,
contradicting results presented in the work of [1].
An example of the top five terms extracted from one document, ranked by the Phi-
Square metric for the worked languages is shown in Table 1. This document was
about scientific and technical information and documentation and ethics.
As the corpus used elaborated on Science, Information dissemination, education
and training, for the example document the word “science” alone was naturally
demoted.
Search WWH ::




Custom Search