Information Technology Reference
In-Depth Information
Language Independent Extraction of Key Terms:
An Extensive Comparison of Metrics
Luís F.S. Teixeira 2 , Gabriel P. Lopes 2 , and Rita A. Ribeiro 1
1 CA3-Uninova, Campus FCT/UNL, 2829-516 Caparica, Portugal
2 CITI, Dep. Informática, FCT/UNL, 2829-516 Caparica, Portugal
lst@luisteixeira.org, gpl@fct.unl.pt, rar@uninova.pt
Abstract. In this paper twenty language independent statistically-based metrics
used for key term extraction from any document collection are compared. Some
of those metrics are widely used for this purpose. The others were recently
created. Two different document representations are considered in our
experiments. One is based on words and multi-words and the other is based on
word prefixes of fixed length (5 characters for the experiments made). Prefixes
were used for studying how morphologically rich languages, namely
Portuguese and Czech behave when applying this other kind of representation.
English is also studied taking it, as a non-morphologically rich language.
Results are manually evaluated and agreement between evaluators is assessed
using k-Statistics. The metrics based on Tf-Idf and Phi-square proved to have
higher precision and recall. The use of prefix-based representation of
documents enabled a significant precision improvement for documents written
in Portuguese. For Czech, recall also improved.
Keywords: Document keywords, Document topics, Words, Multi-words,
Prefixes, Automatic extraction, Suffix arrays.
1
Introduction
A key term, a keyword or a topic of a document is any word or multi-word (taken as a
sequence of two or more words, expressing clear cut concepts) that reveals important
information about the content of a document from a larger collection of documents.
Extraction of document key terms is far from being solved. However this is an
important problem that deserves further attention, since most documents are still (and
will continue to be) produced without explicit indication of their key terms as
metadata. Moreover, most existing key term extractors are language dependent and,
as such, require linguistic processing tools that are not available for the majority of
the human languages. Hence our bet, namely in this paper, is on language
independent extraction of key terms.
So, the main aim of this work is to compare metrics for improving automatic
extraction of key single and multi-words from document collections, and to contribute
to the discussion on this subject matter.
Our starting point was the work by [1], on multiword key term extraction, where
two basic metrics were used: Tf-Idf and relative variance (Rvar). By looking more
 
Search WWH ::




Custom Search