Language Independent Extraction of Key Terms: An Extensive Comparison of Metrics - Agents and Artificial Intelligence

Information Technology Reference

In-Depth Information

carefully at the examples shown in [1], where plain Tf-Idf metric is used, it became

apparent that, the comparison made between the two metrics was unfair. A fair

comparison would require the use of a Tf-Idf derived metric taking into account the Tf-

Idf values for multi-word extremities as well as the medium character length per word

of each multi-word as it had been proposed for the use of Rvar variant metric in [1].

Moreover, as one needs to calculate the relevance of words and multi-words using the

same metrics, we decided to rank simultaneously words and multi-words describing

the content of any document in a collection according to the metric assigned value to

that word or multi-word. This diverges from the proposal made in [2] where an “a

priori” fixed proportion of words and multi-words is required. And no one knows “a

priori” which documents are better described by words alone or by multi-words. Nor

does she/he know the best proportion of key words or key multi-words.

This way, our work improves the discussion started in [1], and continued in [2], but

we arrive at different conclusions, namely that Tf-Idf and Phi-square based metrics

enabled higher precision and recall for the extraction of document key terms. The use

of a prefix-based representation of documents enabled a significant improvement for

documents written in Portuguese and a minor improvement for Czech, as

representatives of morphologically rich languages, regarding precision results.

Additionally we also extend the preliminary discussion started in [3] where some

of the metrics used in current work were presented. To achieve our aims we compare

results obtained by using four basic metrics (Tf-Idf, Phi-square, Mutual Information

and Relative Variance) and derived metrics taking into account per word character

median length of words and multi-words and giving specific attention to word

extremities of multi-words and of words ( where left and right extremities of a word

are considered to be identical to the word proper ). This led to a first experiment

where we compare 12 metrics (3 variants of 4 metrics). On a second experiment, we

decided to use a different document representation in terms of word prefixes of 5

characters in order to tackle morphologically rich languages. As it would be senseless

to evaluate the relevance of prefixes, it became necessary to project (bubble) prefix

relevance into words and into multi-words.

All the experimental results were manually evaluated and agreement between

evaluators was assessed using k-Statistics.

This paper is structured as follows: related work is summarized in section 2; our

system, the data and the experimental procedures used are described in section 3; the

metrics used are defined in section 4; results obtained are shown in section 5;

conclusions and future work are discussed in section 6.

2

Related Work

In the area of document classification it is necessary to select features that later will

be used for training new classifiers and for classifying new documents. This feature

selection task is somehow related to the extraction of key terms addressed in this

paper. In [4], a rather complete overview of the main metrics used for feature

selection for document classification and clustering is made.

Agents and Artificial Intelligence

Search WWH ::

Custom Search

Home