Database Reference
In-Depth Information
Because
, the client shown in Table
7.4 is assigned with the label
. That is, the client is classified as
likely to subscribe to the term deposit.
Although the scores are small in magnitude, it is the ratio of and
that matters. In fact, the scores of and are not the true
probabilities but are only proportional to the true probabilities, as shown in
Equation 7.14 . After all, if the scores were indeed the true probabilities, the sum
of and would be equal to one. When looking at problems with a
large number of attributes, or attributes with a high number of levels, these values
can become very small in magnitude (close to zero), resulting in even smaller
differences of the scores. This is the problem of numerical underflow , caused
by multiplying several probability values that are close to zero. A way to alleviate
the problem is to compute the logarithm of the products, which is equivalent to the
summation of the logarithm of the probabilities. Thus, the naïve Bayes formula can
be rewritten as shown in Equation 7.15 .
7.15
Although the risk of underflow may increase as the number of attributes increases,
the use of logarithms is usually applied regardless of the number of attribute
dimensions.
7.2.3 Smoothing
If one of the attribute values does not appear with one of the class labels within
the training set, the corresponding will equal zero. When this happens, the
resulting from multiplying all the immediately becomes
zero regardless of how large some of the conditional probabilities are. Therefore
overfitting occurs. Smoothing techniques can be employed to adjust the
probabilities of
and to ensure a nonzero value of
. A smoothing
 
Search WWH ::




Custom Search