Information Technology Reference
In-Depth Information
explaining the statistical significance of the keywords [17]. Term Frequency (TF)
was proposed by Salton & McGill in 1983 aims for data indexing with IDF, which
integrate TF and IDF and become the a weighting algorithm for the keywords. The
reason to use this algorithm is that the keywords used in each document vary from
document to document [16] and therefore by combining TF and IDF, it is now
possible to derive the relative weight of a keyword in all documents.
TF-IDF is mainly used in finding the relative weight of a keyword in a docu-
ment. TF means the frequency of appearance of the keyword and IDF is used to
find the relative importance of the keyword.
D
IDF
=
log
{
}
i
d
:
t
d
j
i
j
D is the total number of documents
{
}
d
:
t
d
is the number of documents that contains the keyword i .
j
i
j
n
i
,
j
TF
=
denote the frequency count of the appearance of a keyword in
i
,
j
n
k
,
j
k
a document divided by sum of all keywords' appearance frequency.
TF shows the relative importance of the keywords in a given document. IDF
shows the importance of this keyword in the entire cohort. A keyword will be giv-
en higher IDF value if it is used only in small number of documents because it has
more discriminative power.
For example, in the cultural event, if the word “Hakka”(a unique ethnic group
of "Han" Chinese) is considered a keyword and it appears in a small number of
documents, its IDF value would be high. However, the words like “food” and
“good” appear in all documents and therefore have the IDF value close to zero. In
TF, the more frequently a word is used, the higher the TF value in relation to the
total number of keywords in a document. If the word “Hakka” is used in a docu-
ment frequently, since it has high IDF and high TF, the word “Hakka” should be
considered a very significant keyword for recommendation.
This method of utilizing the tags from heterogeneous information sources leads
to research issues for tag classification and weighting. As described above, tags
with high frequency count does not necessarily mean it is more important, there-
fore we will classify the tags using TF-IDF algorithm to provide accurate result
for decision reference.
3.2 The CTD Method
The CTD (Category Term Descriptor) method was proposed by Bong & Naraya-
nan in 2004. It is derived based on classic term weighting scheme, TF-IDF. The
method explicitly chooses feature set for each category by only selecting set of
 
Search WWH ::




Custom Search