Graphics Reference
In-Depth Information
×
=
s
t
n
s
=
t
Thus, this discretizer is an equal frequency (or equal width) discretizer where both
the interval frequency and number of intervals have the same quantity and they only
depend on the number of instances in training data.
FFD [ 122 ]
FFD stands for fixed frequency discretization and was proposed for managing bias
and variance especially in naive-bayes based classifiers. To discretize a continuous
attribute, FFD sets a sufficient interval frequency , f . Then it discretizes the ascend-
ingly sorted values into intervals of frequency f . Thus each interval has approxi-
mately the same number f of training instances with adjacent values. Incorporating
f , FFD aims to ensure that the interval frequency is sufficient so that there are enough
training instances in each interval to reliably estimate the Naïve Bayes probabilities.
There may be confusion when distinguishing equal frequency discretization from
FFD. The former one fixes the interval number, thus it arbitrarily chooses the interval
number and then discretizes a continuous attribute into intervals such that each inter-
val has the same number of training instances. On the other hand, the later method,
FFD, fixes the interval frequency by the value f . It then sets cut points so that each
interval contains f training instances, controlling the discretization variance.
CAIM [ 70 ]
CAIM stands for Class-Attribute Interdependency Maximization criterion, which
measures the dependency between the class variable C and the discretized variable
D for attibute A . The method requires the computation of the quanta matrix [ 24 ],
which, in summary, collects a snapshot of the number of real values of A within
each interval and for each class of the corresponding example. The criterion is cal-
culated as:
r = 1 max r
M + r
(
,
,
) =
CAIM
C
D
A
m
where m is the number of intervals, r iterates through all intervals, i.e. r
=
1
m ,max r is the maximum value among all q ir values (maximum value
within the r th column of the quanta matrix), M
,
2
,...,
r is the total number of continuous
+
values of attribute A that are within the interval
.
According to the authors, CAIM has the following properties:
(
d r 1 ,
d r ]
The larger the value of CAIM, the higher the interdependence between the class
labels and the discrete intervals.
It generates discretization schemes where each interval has all of its values grouped
within a single class label.
It has taken into account the negative impact that values belonging to classes, other
than the class with the maximum number of values within an interval, have on the
discretization scheme.
 
Search WWH ::




Custom Search