Information Technology Reference
In-Depth Information
Main technique to discover user interest is described in section 5. Section 6 is
the evaluation.
1 Vector Model for Representing Documents
Suppose our corpus D is the composition of documents D i
D = { D 1 , D 2 ,…,
D m }. Every document D i contains a set of key words so-called terms . The number
of times a term occurs in a document is called term frequency . Given the
document D i and term t j , the term frequency tf ij measuring the importance of term
t j within document D i is defined as below:
n
ij
tf
=
ij
n
ik
k
Where n ij is the number of occurrences of term t j in document D i , and the
denominator is the sum of number of occurrences of all terms in document D i .
Suppose we need to search documents which are most relevant to the query
having term t j . The simple way is to choose documents which have highest term
frequency tf ij . However in situation that t j is not a good term to distinguish
between relevant and non-relevant documents and other terms occurring rarely are
better ones to distinguish between relevant and non-relevant documents. This will
tend to incorrectly emphasize documents containing term t j more, without giving
enough weight to other meaningful terms. So the inverse document frequency is a
measure of general importance of the term. It is used to decrease the weight of
terms occurring frequently and increase the weight of terms occurring rarely. The
inverse document frequency of term t j is the ratio of the size of corpus to the
number of documents that t j occurs.
|
corpus
|
idf
=
log
j
|
{
D
:
t
D
}
|
j
Where | corpus | is the total number of documents in corpus and |{ D: t j
D }| is the
number of documents containing term t j . We use log function to normalize idf j so
that it is less than or equal 1 .
The weight of term t j in document D i is defined as product of tf ij and idf i
w ij = tf ij * idf i
This weight measure the importance of a term in a document over the corpus. It
increases proportionally to the number of times a term occurs in the document but
is offset by the frequency of this term in the corpus. In general this weight
balances the importance of two measures: term frequency and inverse document
frequency.
Suppose there are n terms {t i , t 2 ,…, t n }, each document D i is modeled as the
vector which is composed of weights of such terms.
D i = ( w i1 , w i2 , w i3 ,…, w in )
 
Search WWH ::




Custom Search