Advanced Analytical Theory and Methods: Classification - Data Science and Big Data Analytics

Database Reference

In-Depth Information

Because

, the client shown in Table

7.4 is assigned with the label

. That is, the client is classified as

likely to subscribe to the term deposit.

Although the scores are small in magnitude, it is the ratio of and

that matters. In fact, the scores of and are not the true

probabilities but are only proportional to the true probabilities, as shown in

Equation 7.14 . After all, if the scores were indeed the true probabilities, the sum

of and would be equal to one. When looking at problems with a

large number of attributes, or attributes with a high number of levels, these values

can become very small in magnitude (close to zero), resulting in even smaller

differences of the scores. This is the problem of numerical underflow , caused

by multiplying several probability values that are close to zero. A way to alleviate

the problem is to compute the logarithm of the products, which is equivalent to the

summation of the logarithm of the probabilities. Thus, the naïve Bayes formula can

be rewritten as shown in Equation 7.15 .

7.15

Although the risk of underflow may increase as the number of attributes increases,

the use of logarithms is usually applied regardless of the number of attribute

dimensions.

7.2.3 Smoothing

If one of the attribute values does not appear with one of the class labels within

the training set, the corresponding will equal zero. When this happens, the

resulting from multiplying all the immediately becomes

zero regardless of how large some of the conditional probabilities are. Therefore

overfitting occurs. Smoothing techniques can be employed to adjust the

probabilities of

and to ensure a nonzero value of

. A smoothing

Search WWH ::

Custom Search

Home