Information Technology Reference
In-Depth Information
As for the extraction of multi-words and collocations, since we also need to extract
them, we just mention the work by [5], using no linguistic knowledge, and the work
by [6], requiring linguistic knowledge.
In the area of keyword and key multi-word extraction, [7-11] address the extraction
of keywords in English. Moreover those authors use language dependent tools (stop-
words removing, lemmatization, part-of-speech tagging and syntactic pattern
recognition) for extracting noun phrases. Being language independent, our approach
clearly diverges from these ones. Approaches dealing with extraction of key-phrases
(that are according to the authors “ short phrases that indicate the main topic of a
document ”) include the work of [12] where Tf-Idf metric is used as well as several
language dependent tools. In [13], a graph-based ranking model for text processing is
used. The authors use a 2-phase approach for the extraction task: first they select key-
phrases representative of a given text; then they extract the most “important”
sentences in a text to be used for summarizing document content.
In [14] the author tackles the problem of automatically extracting key-phrases from
text as a supervised learning task. And he deals with a document as a set of phrases,
which his classifier learns to identify as positive or negative examples of key-phrases.
[15] deal with eight different languages, use statistical metrics aided by linguistic
processing, both to extract key phrases and keywords. Dealing also with more than
one language, [1] extract key multi-words using purely statistical measures. In [2]
statistical extraction of keywords is also tackled but a predefined ratio of keywords
and key multi-words is considered per document, thus jeopardizing statistical purity.
[16] present, a keyword extraction algorithm that applies to isolated documents,
not in a collection. They extract frequent terms and a set of co-occurrences between
each term and the frequent terms.
In summary, the approach followed in our work is unsupervised, language
independent and extracts key words or multi-words solely depending on their ranking
values obtained by applying the 20 metrics announced and explained bellow in
section 4.
3
System Data and Experiments
Our system is made of 3 distinct modules. First module is responsible for extracting
multi-words, based on [17] and using the extractor of [18]. A Suffix Array was used
[19,20] for frequency counting of words, multi-words and prefixes. This module is
also responsible for ranking, according to the metrics used, words and multi-words
per document. And it allows the back office user to define the minimum word and
prefix length to be considered. In the experiments reported we have fixed minimum
word length at 6 characters and minimum prefix length a5 characters.
Second module is a user interface designed to allow external evaluators to classify
the best 25 terms ranked according to each of selected metrics. When moving from
ranking classification related to one metric to the ranking produced by another metric,
evaluations already made are pervasively propagated. This feature enables evaluators
to reconsider at any time some of their own earlier options.
Search WWH ::




Custom Search