Information Technology Reference
In-Depth Information
term t in document dj , t denotes a prefix, a word, or a multiword, and Ndj refers to the
number of words or n-grams of words contained in . The total number of
documents present in the corpus is given by . The use of a probability in (1)
normalizes the Tf-Idf metric, making it independent of the size of the document under
consideration.
Rvar (Relative Variance) is the metric proposed by [1], defined in equation (4). It
does not take into account the occurrence of a given term t in a specific document in
the corpus. It deals with the whole corpus, and thus loses the locality characteristics of
term t. This locality is recovered when the best ranked terms are reassigned to the
documents where they occur.
Rvart 1 D⁄pt,d pw,. pt,.
(4)
, is defined in (2) and ,. denotes the mean probability of term t , taking
into account all documents in the collection. As above, we take t as denoting a prefix,
a word, or a multiword.
Phi-Square [21] is a variant of the well-known Chi-Square metric. It allows a
normalization of the results obtained with Chi-Square, and is defined in equation (5),
where is the total number of terms (prefixes, words, or multi-words) present in the
corpus (the sum of terms from all documents in the collection). A denotes the number
of times term occurs in document d. B stands for the number of times term occurs
in documents other than d, in the collection. C means the number of terms of the
document subtracted by the amount of times term occurs in document . Finally,
D is the number of times that neither document nor term t co-occur (i.e., N-A-B-C,
where N denotes the total number of documents).
(5)
,
Mutual Information [22] is a widely used metric for identifying associations
between randomly selected terms. For our purposes we used equation (6) where t, d,
A, B, C and N have identical meanings as above for equation (5).
, log ⁄ (6)
In the rest of this section we will introduce derivations of the metrics presented above
for dealing, on equivalent grounds, with aspects that were considered crucial in [1,2]
for extracting key terms. Those derivations will be defined on the basis of 3 operators:
Least (L), Median (M) and Bubble (B). In the equations below stands for any of
the previously presented metrics (Tf-Idf, Rvar, Phi-square or , and Mutual
Information or MI), P stands for a Prefix, for a word, for a multi-word taken
as word sequence ( ).
Least Operator is inspired by the metric LeastRvar introduced in [1] and
coincides with that metric if it is applied to Rvar.
min , (7)
is determined as the minimum of applied to the leftmost and
rightmost words of , w 1 and w n . In order to treat all metrics on equal grounds
operator “Least” will now be applied to metric MT , where MT may be any of the
Search WWH ::




Custom Search