Language Independent Extraction of Key Terms: An Extensive Comparison of Metrics - Agents and Artificial Intelligence

Information Technology Reference

In-Depth Information

term t in document dj , t denotes a prefix, a word, or a multiword, and Ndj refers to the

number of words or n-grams of words contained in . The total number of

documents present in the corpus is given by . The use of a probability in (1)

normalizes the Tf-Idf metric, making it independent of the size of the document under

consideration.

Rvar (Relative Variance) is the metric proposed by [1], defined in equation (4). It

does not take into account the occurrence of a given term t in a specific document in

the corpus. It deals with the whole corpus, and thus loses the locality characteristics of

term t. This locality is recovered when the best ranked terms are reassigned to the

documents where they occur.

Rvart 1 D⁄pt,d pw,. pt,.

⁄

(4)

, is defined in (2) and ,. denotes the mean probability of term t , taking

into account all documents in the collection. As above, we take t as denoting a prefix,

a word, or a multiword.

Phi-Square [21] is a variant of the well-known Chi-Square metric. It allows a

normalization of the results obtained with Chi-Square, and is defined in equation (5),

where is the total number of terms (prefixes, words, or multi-words) present in the

corpus (the sum of terms from all documents in the collection). A denotes the number

of times term occurs in document d. B stands for the number of times term occurs

in documents other than d, in the collection. C means the number of terms of the

document subtracted by the amount of times term occurs in document . Finally,

D is the number of times that neither document nor term t co-occur (i.e., N-A-B-C,

where N denotes the total number of documents).

(5)

,

Mutual Information [22] is a widely used metric for identifying associations

between randomly selected terms. For our purposes we used equation (6) where t, d,

A, B, C and N have identical meanings as above for equation (5).

, log ⁄ (6)

In the rest of this section we will introduce derivations of the metrics presented above

for dealing, on equivalent grounds, with aspects that were considered crucial in [1,2]

for extracting key terms. Those derivations will be defined on the basis of 3 operators:

Least (L), Median (M) and Bubble (B). In the equations below stands for any of

the previously presented metrics (Tf-Idf, Rvar, Phi-square or , and Mutual

Information or MI), P stands for a Prefix, for a word, for a multi-word taken

as word sequence ( … ).

Least Operator is inspired by the metric LeastRvar introduced in [1] and

coincides with that metric if it is applied to Rvar.

min , (7)

is determined as the minimum of applied to the leftmost and

rightmost words of , w 1 and w n . In order to treat all metrics on equal grounds

operator “Least” will now be applied to metric MT , where MT may be any of the

Agents and Artificial Intelligence

Search WWH ::

Custom Search

Home