Language Independent Extraction of Key Terms: An Extensive Comparison of Metrics - Agents and Artificial Intelligence

Information Technology Reference

In-Depth Information

As for the extraction of multi-words and collocations, since we also need to extract

them, we just mention the work by [5], using no linguistic knowledge, and the work

by [6], requiring linguistic knowledge.

In the area of keyword and key multi-word extraction, [7-11] address the extraction

of keywords in English. Moreover those authors use language dependent tools (stop-

words removing, lemmatization, part-of-speech tagging and syntactic pattern

recognition) for extracting noun phrases. Being language independent, our approach

clearly diverges from these ones. Approaches dealing with extraction of key-phrases

(that are according to the authors “ short phrases that indicate the main topic of a

document ”) include the work of [12] where Tf-Idf metric is used as well as several

language dependent tools. In [13], a graph-based ranking model for text processing is

used. The authors use a 2-phase approach for the extraction task: first they select key-

phrases representative of a given text; then they extract the most “important”

sentences in a text to be used for summarizing document content.

In [14] the author tackles the problem of automatically extracting key-phrases from

text as a supervised learning task. And he deals with a document as a set of phrases,

which his classifier learns to identify as positive or negative examples of key-phrases.

[15] deal with eight different languages, use statistical metrics aided by linguistic

processing, both to extract key phrases and keywords. Dealing also with more than

one language, [1] extract key multi-words using purely statistical measures. In [2]

statistical extraction of keywords is also tackled but a predefined ratio of keywords

and key multi-words is considered per document, thus jeopardizing statistical purity.

[16] present, a keyword extraction algorithm that applies to isolated documents,

not in a collection. They extract frequent terms and a set of co-occurrences between

each term and the frequent terms.

In summary, the approach followed in our work is unsupervised, language

independent and extracts key words or multi-words solely depending on their ranking

values obtained by applying the 20 metrics announced and explained bellow in

section 4.

3

System Data and Experiments

Our system is made of 3 distinct modules. First module is responsible for extracting

multi-words, based on [17] and using the extractor of [18]. A Suffix Array was used

[19,20] for frequency counting of words, multi-words and prefixes. This module is

also responsible for ranking, according to the metrics used, words and multi-words

per document. And it allows the back office user to define the minimum word and

prefix length to be considered. In the experiments reported we have fixed minimum

word length at 6 characters and minimum prefix length a5 characters.

Second module is a user interface designed to allow external evaluators to classify

the best 25 terms ranked according to each of selected metrics. When moving from

ranking classification related to one metric to the ranking produced by another metric,

evaluations already made are pervasively propagated. This feature enables evaluators

to reconsider at any time some of their own earlier options.

Agents and Artificial Intelligence

Search WWH ::

Custom Search

Home