Machine Learning with MLlib - Learning Spark

Database Reference

In-Depth Information

The LogisticRegressionModel from these algorithms computes a score between 0

and 1 for each point, as returned by the logistic function. It then returns either 0 or 1

based on a threshold that can be set by the user: by default, if the score is at least 0.5, it

will return 1 . You can change this threshold via setThreshold() . You can also dis‐

able it altogether via clearThreshold() , in which case predict() will return the raw

scores. For balanced datasets with about the same number of positive and negative

examples, we recommend leaving the threshold at 0.5. For imbalanced datasets, you

can increase the threshold to drive down the number of false positives (i.e., increase

precision but decrease recall), or you can decrease the threshold to drive down the

number of false negatives.

When using logistic regression, it is usually important to scale the

features in advance to be in the same range. You can use MLlib's

StandardScaler to do this, as seen in “Scaling” on page 222 .

Support Vector Machines

Support Vector Machines, or SVMs, are another binary classification method with

linear separating planes, again expecting labels of 0 or 1. They are available through

the SVMWithSGD class, with similar parameters to linear and logisitic regression. The

returned SVMModel uses a threshold for prediction like LogisticRegressionModel .

Naive Bayes

Naive Bayes is a multiclass classification algorithm that scores how well each point

belongs in each class based on a linear function of the features. It is commonly used

in text classification with TF-IDF features, among other applications. MLlib imple‐

ments Multinomial Naive Bayes, which expects nonnegative frequencies (e.g., word

frequencies) as input features.

In MLlib, you can use Naive Bayes through the mllib.classification.NaiveBayes

class. It supports one parameter, lambda (or lambda_ in Python), used for smoothing.

You can call it on an RDD of LabeledPoint s, where the labels are between 0 and C -1

for C classes.

The returned NaiveBayesModel lets you predict() the class in which a point best

belongs, as well as access the two parameters of the trained model: theta , the matrix

of class probabilities for each feature (of size C × D for C classes and D features), and

pi , the C -dimensional vector of class priors.

Search WWH ::

Custom Search

Home