Discovering Proximal Social Intelligence for Quality Decision Support - Mining and Analyzing Social Networks

Information Technology Reference

In-Depth Information

explaining the statistical significance of the keywords [17]. Term Frequency (TF)

was proposed by Salton & McGill in 1983 aims for data indexing with IDF, which

integrate TF and IDF and become the a weighting algorithm for the keywords. The

reason to use this algorithm is that the keywords used in each document vary from

document to document [16] and therefore by combining TF and IDF, it is now

possible to derive the relative weight of a keyword in all documents.

TF-IDF is mainly used in finding the relative weight of a keyword in a docu-

ment. TF means the frequency of appearance of the keyword and IDF is used to

find the relative importance of the keyword.

D

IDF

=

log

{

}

i

d

:

t

∈

d

j

i

j

D is the total number of documents

{

}

d

:

t

∈

d

is the number of documents that contains the keyword i .

j

i

j

n

i

,

j

TF

=

denote the frequency count of the appearance of a keyword in

∑

i

,

j

n

k

,

j

k

a document divided by sum of all keywords' appearance frequency.

TF shows the relative importance of the keywords in a given document. IDF

shows the importance of this keyword in the entire cohort. A keyword will be giv-

en higher IDF value if it is used only in small number of documents because it has

more discriminative power.

For example, in the cultural event, if the word “Hakka”(a unique ethnic group

of "Han" Chinese) is considered a keyword and it appears in a small number of

documents, its IDF value would be high. However, the words like “food” and

“good” appear in all documents and therefore have the IDF value close to zero. In

TF, the more frequently a word is used, the higher the TF value in relation to the

total number of keywords in a document. If the word “Hakka” is used in a docu-

ment frequently, since it has high IDF and high TF, the word “Hakka” should be

considered a very significant keyword for recommendation.

This method of utilizing the tags from heterogeneous information sources leads

to research issues for tag classification and weighting. As described above, tags

with high frequency count does not necessarily mean it is more important, there-

fore we will classify the tags using TF-IDF algorithm to provide accurate result

for decision reference.

3.2 The CTD Method

The CTD (Category Term Descriptor) method was proposed by Bong & Naraya-

nan in 2004. It is derived based on classic term weighting scheme, TF-IDF. The

method explicitly chooses feature set for each category by only selecting set of

Mining and Analyzing Social Networks

Search WWH ::

Custom Search

Home