Building a Classification Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

The naïve Bayes model

Naïve Bayes is a probabilistic model that makes predictions by computing the probability

of a data point that belongs to a given class. A naïve Bayes model assumes that each fea-

ture makes an independent contribution to the probability assigned to a class (it assumes

conditional independence between features).

Due to this assumption, the probability of each class becomes a function of the product of

the probability of a feature occurring, given the class, as well as the probability of this

class. This makes training the model tractable and relatively straightforward. The class pri-

or probabilities and feature conditional probabilities are all estimated from the frequencies

present in the dataset. Classification is performed by selecting the most probable class, giv-

en the features and class probabilities.

An assumption is also made about the feature distributions (the parameters of which are es-

timated from the data). MLlib implements multinomial naïve Bayes that assumes that the

feature distribution is a multinomial distribution that represents non-negative frequency

counts of the features.

It is suitable for binary features (for example, 1-of-k encoded categorical features) and is

commonly used for text and document classification (where, as we have seen in Chapter 3 ,

Obtaining, Processing, and Preparing Data with Spark , the bag-of-words vector is a typic-

al feature representation).

Note

Take a look at the MLlib - Naive Bayes section in the Spark documentation at ht-

The Wikipedia page at http://en.wikipedia.org/wiki/Naive_Bayes_classifier has a more de-

tailed explanation of the mathematical formulation.

Here, we have shown the decision function of naïve Bayes on our simple binary classifica-

tion example:

Search WWH ::

Custom Search

Home