Information Technology Reference
In-Depth Information
10.4 Category-Based Term Weights
10.4.1 Revisit of tfidf
As stated before, while many researchers believe that the term weighting schemes
in the form as tfidf representing those three aforementioned assumptions, we un-
derstand tfidf in a much simpler manner, i.e.,
a) Local weight - the tf term, either normalized or not, specifies the weight of t k
within a specific document, which is basically estimated based on the frequency
or relative frequency of t k within this document.
b) Global weight - the idf term, either normalized or not, defines the contribution
of t k to a specific document in a global sense.
If we temporarily ignore how tfidf is defined, and focus on the core problem,
i.e., whether this document is from this category, we realize we need a set of terms to
represent the documents effectively and a reference framework to make the compar-
ison possible. As previous research shows that tf is very important [22, 29, 38, 40]
and using tf alone can already achieve good performance, we retain the tf term.
Now, let us consider idf , i.e., the global weighting of t k .
The conjecture is that if the term selection can effectively differentiate a set
of terms T k out from all terms T to represent category c i , then it is desirable to
transform that difference into some sort of numeric values for further processing.
Our approach is to replace the idf term with the value generated using term selec-
tion. Since this procedure is performed jointly with the category membership, this
basically implies that the weights of T k are category specific. Therefore, the only
problem left is which term selection method is appropriate to compute such values.
10.4.2 Category-Based Term Weights
We decide to compute the term values using the most direct information, e.g., A , B
and C , and combine them in a sensible way which is different from existing feature
selection measures. From Table 10.2, two important ratios are noted, i.e., A/B and
A/C ,
A/B : it is easy to understand that if term t k is highly relevant to category c i
only, which basically says that t k is a good feature to represent category c i , then
the value of A/B tends to be higher.
A/C : given two terms t k , t l and a category c i , the term with the higher value of
A/C , will be the better feature to represent c i , since a larger portion of it occurs
with category c i .
Obviously, the role of A/B is straightforward and, when A/B is equal, A/C
can possibly further differentiate these terms. This has been conjectured to be the
contribution of A/C . In fact, these two ratios are nicely supported by probability
estimates. For instance, A/B can be extended as ( A/N ) / ( B/N ), where N is the
total number of documents, A/N is the probability estimate of documents from
category c i where term t k occurs at least once and B/N is the probability estimate
of documents not from category c i where term t k occurs at least once. In this manner,
A/B can be interpreted as a relevance indicator of term t k with respect to category
Search WWH ::




Custom Search