Language Independent Extraction of Key Terms: An Extensive Comparison of Metrics - Agents and Artificial Intelligence

Information Technology Reference

In-Depth Information

Language Independent Extraction of Key Terms:

An Extensive Comparison of Metrics

Luís F.S. Teixeira 2 , Gabriel P. Lopes 2 , and Rita A. Ribeiro 1

1 CA3-Uninova, Campus FCT/UNL, 2829-516 Caparica, Portugal

2 CITI, Dep. Informática, FCT/UNL, 2829-516 Caparica, Portugal

lst@luisteixeira.org, gpl@fct.unl.pt, rar@uninova.pt

Abstract. In this paper twenty language independent statistically-based metrics

used for key term extraction from any document collection are compared. Some

of those metrics are widely used for this purpose. The others were recently

created. Two different document representations are considered in our

experiments. One is based on words and multi-words and the other is based on

word prefixes of fixed length (5 characters for the experiments made). Prefixes

were used for studying how morphologically rich languages, namely

Portuguese and Czech behave when applying this other kind of representation.

English is also studied taking it, as a non-morphologically rich language.

Results are manually evaluated and agreement between evaluators is assessed

using k-Statistics. The metrics based on Tf-Idf and Phi-square proved to have

higher precision and recall. The use of prefix-based representation of

documents enabled a significant precision improvement for documents written

in Portuguese. For Czech, recall also improved.

Keywords: Document keywords, Document topics, Words, Multi-words,

Prefixes, Automatic extraction, Suffix arrays.

1

Introduction

A key term, a keyword or a topic of a document is any word or multi-word (taken as a

sequence of two or more words, expressing clear cut concepts) that reveals important

information about the content of a document from a larger collection of documents.

Extraction of document key terms is far from being solved. However this is an

important problem that deserves further attention, since most documents are still (and

will continue to be) produced without explicit indication of their key terms as

metadata. Moreover, most existing key term extractors are language dependent and,

as such, require linguistic processing tools that are not available for the majority of

the human languages. Hence our bet, namely in this paper, is on language

independent extraction of key terms.

So, the main aim of this work is to compare metrics for improving automatic

extraction of key single and multi-words from document collections, and to contribute

to the discussion on this subject matter.

Our starting point was the work by [1], on multiword key term extraction, where

two basic metrics were used: Tf-Idf and relative variance (Rvar). By looking more

Search WWH ::

Custom Search

Home