Handling of Imbalanced Data in Text Classification: Category-Based Term Weights - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

10.4 Category-Based Term Weights

10.4.1 Revisit of tfidf

As stated before, while many researchers believe that the term weighting schemes

in the form as tfidf representing those three aforementioned assumptions, we un-

derstand tfidf in a much simpler manner, i.e.,

a) Local weight - the tf term, either normalized or not, specifies the weight of t k

within a specific document, which is basically estimated based on the frequency

or relative frequency of t k within this document.

b) Global weight - the idf term, either normalized or not, defines the contribution

of t k to a specific document in a global sense.

If we temporarily ignore how tfidf is defined, and focus on the core problem,

i.e., whether this document is from this category, we realize we need a set of terms to

represent the documents effectively and a reference framework to make the compar-

ison possible. As previous research shows that tf is very important [22, 29, 38, 40]

and using tf alone can already achieve good performance, we retain the tf term.

Now, let us consider idf , i.e., the global weighting of t k .

The conjecture is that if the term selection can effectively differentiate a set

of terms T k out from all terms T to represent category c i , then it is desirable to

transform that difference into some sort of numeric values for further processing.

Our approach is to replace the idf term with the value generated using term selec-

tion. Since this procedure is performed jointly with the category membership, this

basically implies that the weights of T k are category specific. Therefore, the only

problem left is which term selection method is appropriate to compute such values.

10.4.2 Category-Based Term Weights

We decide to compute the term values using the most direct information, e.g., A , B

and C , and combine them in a sensible way which is different from existing feature

selection measures. From Table 10.2, two important ratios are noted, i.e., A/B and

A/C ,

•

A/B : it is easy to understand that if term t k is highly relevant to category c i

only, which basically says that t k is a good feature to represent category c i , then

the value of A/B tends to be higher.

•

A/C : given two terms t k , t l and a category c i , the term with the higher value of

A/C , will be the better feature to represent c i , since a larger portion of it occurs

with category c i .

Obviously, the role of A/B is straightforward and, when A/B is equal, A/C

can possibly further differentiate these terms. This has been conjectured to be the

contribution of A/C . In fact, these two ratios are nicely supported by probability

estimates. For instance, A/B can be extended as ( A/N ) / ( B/N ), where N is the

total number of documents, A/N is the probability estimate of documents from

category c i where term t k occurs at least once and B/N is the probability estimate

of documents not from category c i where term t k occurs at least once. In this manner,

A/B can be interpreted as a relevance indicator of term t k with respect to category

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home