Discovering User Interests by Document Classification - Mining and Analyzing Social Networks

Information Technology Reference

In-Depth Information

Main technique to discover user interest is described in section 5. Section 6 is

the evaluation.

1 Vector Model for Representing Documents

∈

Suppose our corpus D is the composition of documents D i

D = { D 1 , D 2 ,…,

D m }. Every document D i contains a set of key words so-called terms . The number

of times a term occurs in a document is called term frequency . Given the

document D i and term t j , the term frequency tf ij measuring the importance of term

t j within document D i is defined as below:

∑

Where n ij is the number of occurrences of term t j in document D i , and the

denominator is the sum of number of occurrences of all terms in document D i .

Suppose we need to search documents which are most relevant to the query

having term t j . The simple way is to choose documents which have highest term

frequency tf ij . However in situation that t j is not a good term to distinguish

between relevant and non-relevant documents and other terms occurring rarely are

better ones to distinguish between relevant and non-relevant documents. This will

tend to incorrectly emphasize documents containing term t j more, without giving

enough weight to other meaningful terms. So the inverse document frequency is a

measure of general importance of the term. It is used to decrease the weight of

terms occurring frequently and increase the weight of terms occurring rarely. The

inverse document frequency of term t j is the ratio of the size of corpus to the

number of documents that t j occurs.

corpus

idf

log

{

∈

}

Where | corpus | is the total number of documents in corpus and |{ D: t j

D }| is the

number of documents containing term t j . We use log function to normalize idf j so

that it is less than or equal 1 .

The weight of term t j in document D i is defined as product of tf ij and idf i

w ij = tf ij * idf i

This weight measure the importance of a term in a document over the corpus. It

increases proportionally to the number of times a term occurs in the document but

is offset by the frequency of this term in the corpus. In general this weight

balances the importance of two measures: term frequency and inverse document

frequency.

Suppose there are n terms {t i , t 2 ,…, t n }, each document D i is modeled as the

vector which is composed of weights of such terms.

D i = ( w i1 , w i2 , w i3 ,…, w in )

∈

Mining and Analyzing Social Networks

Search WWH ::

Custom Search

Home