Language Independent Extraction of Key Terms: An Extensive Comparison of Metrics - Agents and Artificial Intelligence

Information Technology Reference

In-Depth Information

metrics Tf-Idf, Rvar, , and MI as depicted in equation (9), when a multiword MW

is at stake. As above, Least_MT of a multiword will be equal to the minimum of

the MT metric value for the extremity words, or , in the multi-word . This

operator was adapted to work with words alone as in equation (8), where the

Least_MT for a word is identical to the rank value obtained for that word using

metric MT. Adaptation was made by assuming that the leftmost and rightmost words

of a single word coincide with the word itself.

_ (8)

_ , (9)

Bubbled Operator , another problem we needed to solve was the propagation of the

relevance of each Prefix (P) to words (W) having P as a prefix.

_ (10)

__ _ ,_ (11)

In bubble based metrics, the rank of a prefix is assigned to the words it prefixes.

Generally it is larger than the rank assigned by the corresponding metric to the word

forms it prefixes. For example, the value assigned to the 5 character prefix “techn” in

a text would be propagated to all words having that prefix, namely “technology”,

“technologies”, “techniques”, if they would appear in the same text.

Median Operator was added in order to better compare the effects of using an

operator similar to the one proposed in [2] which took into account the median character

length of words in multi-words. By doing so, we got metrics defined in equations (12)

and (13), where T represents a term (word or multi-word), LM stands for Least_Median

operator applied to any base metric MT and LBM stands for Least_Bubble_Median

operator applied to metric MT . And Median of a term T is the median of character

lengths of words in a multi-word or of the word at stake. For example, for a multiword

made of three words, of lengths 5, 2 and 6, median length is 5.

_ _

(12)

_ __

(13)

5

Results

In this section we present some of the results obtained. We will also show that Rvar

and its related metrics behave worse than the ones based on Tf-Idf and Phi Square,

contradicting results presented in the work of [1].

An example of the top five terms extracted from one document, ranked by the Phi-

Square metric for the worked languages is shown in Table 1. This document was

about scientific and technical information and documentation and ethics.

As the corpus used elaborated on Science, Information dissemination, education

and training, for the example document the word “science” alone was naturally

demoted.

Search WWH ::

Custom Search

Home