Databases Reference
In-Depth Information
little in this situation. With increased memory capacity, we will be better
and better guided by theoretical bounds in determine sample size.
Sample size 51 is also related to mining quality. However, samples of
the same size could vary in terms of their qualities. In particular, some
samples are more representative or resemble the original data more than
others. Hence, there is a need for measuring sample quality; we then wish
to establish the positive correlation between sample quality and mining
quality.
8.2.4. Feature selection based on information theory
This method is a practical and ecient method which eliminates a
feature that gives little information. The proposed method addresses both
theoretical and empirical aspects of feature selection i.e., a filter approach 55
which can serve more features. It is a type of probabilistic approach i.e.,
for each instance:
Pr ( C/F = f ) ,
(8.1)
where C is the class, F denotes the features, f is a tuple.
This method uses cross-entropy (KL-dist) to select G such that
Pr ( C/G = f G ) is close as previous.
Now:
G = Pr ( f ) G ( F )
(8.2)
and:
G ( F )= D ( Pr ( C/f ) ,Pr ( C/f G ))
(8.3)
i.e., it employs backward elimination (eliminate F i which would cause
smallest increase in triangle).
Working principle:
If Pr ( A = a
|
X = x, B = b )= Pr ( A = a
|
X = x ), then B givesusno
information.
M is markov blanket for a feature F if M does contain F .
With these two measures, two new feature selection algorithms, called
the quadratic MI-based feature selection (QMIFS) approach and the MI-
based constructive criterion (MICC) approach. In classificatory analysis,
Search WWH ::




Custom Search