Building a Classification Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Types of classification models

We will explore three common classification models available in Spark: linear models, de-

cision trees, and naïve Bayes models. Linear models, while less complex, are relatively

easier to scale to very large datasets. Decision tree is a powerful nonlinear technique that

can be a little more difficult to scale up (fortunately, MLlib takes care of this for us!) and

more computationally intensive to train, but delivers leading performance in many situ-

ations. Naïve Bayes models are more simple but are easy to train efficiently and parallelize

(in fact, they require only one pass over the dataset). They can also give reasonable per-

formance in many cases when appropriate feature engineering is used. A naïve Bayes mod-

el also provides a good baseline model against which we can measure the performance of

other models.

Currently, Spark's MLlib library supports binary classification for linear models, decision

trees, and naïve Bayes models and multiclass classification for decision trees and naïve

Bayes models. In this topic, for simplicity in illustrating the examples, we will focus on the

binary case.

Search WWH ::

Custom Search

Home