Database Reference
In-Depth Information
Types of classification models
We will explore three common classification models available in Spark: linear models, de-
cision trees, and naïve Bayes models. Linear models, while less complex, are relatively
easier to scale to very large datasets. Decision tree is a powerful nonlinear technique that
can be a little more difficult to scale up (fortunately, MLlib takes care of this for us!) and
more computationally intensive to train, but delivers leading performance in many situ-
ations. Naïve Bayes models are more simple but are easy to train efficiently and parallelize
(in fact, they require only one pass over the dataset). They can also give reasonable per-
formance in many cases when appropriate feature engineering is used. A naïve Bayes mod-
el also provides a good baseline model against which we can measure the performance of
other models.
Currently, Spark's MLlib library supports binary classification for linear models, decision
trees, and naïve Bayes models and multiclass classification for decision trees and naïve
Bayes models. In this topic, for simplicity in illustrating the examples, we will focus on the
binary case.
Search WWH ::




Custom Search