Database Reference
In-Depth Information
The naïve Bayes model
Naïve Bayes is a probabilistic model that makes predictions by computing the probability
of a data point that belongs to a given class. A naïve Bayes model assumes that each fea-
ture makes an independent contribution to the probability assigned to a class (it assumes
conditional independence between features).
Due to this assumption, the probability of each class becomes a function of the product of
the probability of a feature occurring, given the class, as well as the probability of this
class. This makes training the model tractable and relatively straightforward. The class pri-
or probabilities and feature conditional probabilities are all estimated from the frequencies
present in the dataset. Classification is performed by selecting the most probable class, giv-
en the features and class probabilities.
An assumption is also made about the feature distributions (the parameters of which are es-
timated from the data). MLlib implements multinomial naïve Bayes that assumes that the
feature distribution is a multinomial distribution that represents non-negative frequency
counts of the features.
It is suitable for binary features (for example, 1-of-k encoded categorical features) and is
commonly used for text and document classification (where, as we have seen in Chapter 3 ,
Obtaining, Processing, and Preparing Data with Spark , the bag-of-words vector is a typic-
al feature representation).
Note
Take a look at the MLlib - Naive Bayes section in the Spark documentation at ht-
tp://spark.apache.org/docs/latest/mllib-naive-bayes.html for more information.
The Wikipedia page at http://en.wikipedia.org/wiki/Naive_Bayes_classifier has a more de-
tailed explanation of the mathematical formulation.
Here, we have shown the decision function of naïve Bayes on our simple binary classifica-
tion example:
Search WWH ::




Custom Search