Information Technology Reference
In-Depth Information
carefully at the examples shown in [1], where plain Tf-Idf metric is used, it became
apparent that, the comparison made between the two metrics was unfair. A fair
comparison would require the use of a Tf-Idf derived metric taking into account the Tf-
Idf values for multi-word extremities as well as the medium character length per word
of each multi-word as it had been proposed for the use of Rvar variant metric in [1].
Moreover, as one needs to calculate the relevance of words and multi-words using the
same metrics, we decided to rank simultaneously words and multi-words describing
the content of any document in a collection according to the metric assigned value to
that word or multi-word. This diverges from the proposal made in [2] where an “a
priori” fixed proportion of words and multi-words is required. And no one knows “a
priori” which documents are better described by words alone or by multi-words. Nor
does she/he know the best proportion of key words or key multi-words.
This way, our work improves the discussion started in [1], and continued in [2], but
we arrive at different conclusions, namely that Tf-Idf and Phi-square based metrics
enabled higher precision and recall for the extraction of document key terms. The use
of a prefix-based representation of documents enabled a significant improvement for
documents written in Portuguese and a minor improvement for Czech, as
representatives of morphologically rich languages, regarding precision results.
Additionally we also extend the preliminary discussion started in [3] where some
of the metrics used in current work were presented. To achieve our aims we compare
results obtained by using four basic metrics (Tf-Idf, Phi-square, Mutual Information
and Relative Variance) and derived metrics taking into account per word character
median length of words and multi-words and giving specific attention to word
extremities of multi-words and of words ( where left and right extremities of a word
are considered to be identical to the word proper ). This led to a first experiment
where we compare 12 metrics (3 variants of 4 metrics). On a second experiment, we
decided to use a different document representation in terms of word prefixes of 5
characters in order to tackle morphologically rich languages. As it would be senseless
to evaluate the relevance of prefixes, it became necessary to project (bubble) prefix
relevance into words and into multi-words.
All the experimental results were manually evaluated and agreement between
evaluators was assessed using k-Statistics.
This paper is structured as follows: related work is summarized in section 2; our
system, the data and the experimental procedures used are described in section 3; the
metrics used are defined in section 4; results obtained are shown in section 5;
conclusions and future work are discussed in section 6.
2
Related Work
In the area of document classification it is necessary to select features that later will
be used for training new classifiers and for classifying new documents. This feature
selection task is somehow related to the extraction of key terms addressed in this
paper. In [4], a rather complete overview of the main metrics used for feature
selection for document classification and clustering is made.
Search WWH ::




Custom Search