Language Independent Extraction of Key Terms: An Extensive Comparison of Metrics - Agents and Artificial Intelligence

Information Technology Reference

In-Depth Information

Third module is also a user interface acting as a back office application, allowing

an easy interpretation of the classifications produced by the external evaluators. It also

shows graphically the k-Statistics resulting from evaluations of any two evaluators.

We worked with a collection of texts, for the three languages experimented,

Portuguese (pt), English (en) and Czech (cs), from European legislation ( http://eur-

lex.europa.eu/[L]/legis/latest/chap16.htm , where [L] can be replaced by any of the

following language references: pt, en or cs). The texts were about Science,

Dissemination of information and Education and Training. Czech corpus also

included texts about Culture. Apart from these texts for Czech, the remaining of the

corpus documents was parallel for the three languages. So the total number of terms

for these collections was: 109449 for Portuguese, 100890 for English and 120787 for

Czech.

We worked with words having a minimum length of six characters (this parameter

is changeable) and filtered multi-words (with words of any length) by removing those

containing punctuation marks, numbers and other special symbols. As it will be seen

later some additional filtering operations will be required and discussed in the

conclusions section.

Evaluators were asked to evaluate the 25 best ranked terms for a selected sub-

group of six of the twenty metrics for a sub-set of five randomly selected documents

of a total of 28 documents for each language. The Evaluators had access to the

original documents from where key-words were extracted. When document metadata

contained the keywords assigned to it, evaluators had also access to this information

thus helping the evaluation task. It is worth telling that when this metadata exists, it

generally does not use mutatis mutantis the multi-word terms as they are used in the

document. This information was not used for the extraction task performed.

Four classifications were considered in the evaluation: “good topic descriptor” (G),

“near good topic descriptor” (NG), bad topic descriptor” (B), and “unknown” (U). A

fifth classification, “not evaluated” (NE), was required to enable the propagation of

evaluation, for those metrics that were not specifically evaluated. In Section 5 the

results of the experiments are presented.

It must be stressed that the multiword extractor used is available on the web page

referred in [18].

4

Metrics Used

As mentioned before, we used 4 basic metrics: Tf-Idf, Phi-square, Relative Variance

(Rvar) and Mutual Information (MI).

Formally, Tf-Idf for a term t in a document dj is defined in equations (1), (2) and (3).

, , ,

(1)

, ,

(2)

(3)

, log :

⁄

Notice that, in (1), instead of using the usual term frequency factor, probability

, , defined in equation (2), is used. There, , , denotes the frequency of

Agents and Artificial Intelligence

Search WWH ::

Custom Search

Home