Information Technology Reference
In-Depth Information
based on f max ( t k )=max |c|
i =1
f ( t k ,c i ), i.e., the maximum of the category spe-
cific values.
The sense of either 'global' or 'local' does not have much impact on the selection
of method itself, but it does affect the performance of classifiers built upon different
categories. In TC, the main purpose is to address whether this document belongs to
a specific category. Obviously, we prefer the salient features which are unique from
one category to another, i.e., a 'local' approach. Ideally, the salient feature set from
one category does not have any items overlapping with those from other categories.
If this cannot be avoided, then how to better present them comes into the picture.
While many previous works have shown the relative strengths and merits of these
methods [14, 32, 37, 40, 47], our experience with feature selection over a number
of standard or ad-hoc data sets shows the performance of such methods can be
highly dependant on the data. This is partly due to the lack of understanding of
different data sets in a quantitative way, and it needs further research. From our
previous study of all feature selection methods and what has been reported in the
literature [47], we noted when these methods are applied to text classification for
term selection purpose, they are basically utilizing the four fundamental information
elements shown in Table 10.2, i.e., A denotes the number of documents belonging
to category c i where the term t k occurs at least once; B denotes the number of
documents not belonging to category c i where the term t k occurs at least once; C
denotes the number of documents belonging to category c i where the term t k does
not occur; D denotes the number of documents not belonging to category c i where
the term t k does not occur.
Table 10.2. Fundamental information elements used for feature selection in text
classification
c i c i
t k A B
t k C D
These four information elements have been used to estimate the probabilities
listed in Table 10.1. Table 10.3 shows the functions in Table 10.1 as presented by
the information elements A , B , C and D .
Table 10.3. Feature selection methods and their formations as represented by in-
formation elements in Table 10.2
Method
Mathematical Form Represented by Information Elements
A + C
N
log A + C
N
A
N log(
A + B )+ N log
A
C
C + D
Information Gain
+
Mutual Information
log( AN/ ( A + B )( A + C ))
N ( AD − BC ) 2 / ( A + C )( B + D )( A + B )( C + D )
Chi-square
N ( AD − BC ) / ( A + C )( B + D )( A + B )( C + D )
Correlation Coe cient
Odds Ratio
log( AD/BC )
( AD − BC ) /N 2
Simplified Chi-square
 
Search WWH ::




Custom Search