Database Reference
In-Depth Information
The LogisticRegressionModel from these algorithms computes a score between 0
and 1 for each point, as returned by the logistic function. It then returns either 0 or 1
based on a threshold that can be set by the user: by default, if the score is at least 0.5, it
will return 1 . You can change this threshold via setThreshold() . You can also dis‐
able it altogether via clearThreshold() , in which case predict() will return the raw
scores. For balanced datasets with about the same number of positive and negative
examples, we recommend leaving the threshold at 0.5. For imbalanced datasets, you
can increase the threshold to drive down the number of false positives (i.e., increase
precision but decrease recall), or you can decrease the threshold to drive down the
number of false negatives.
When using logistic regression, it is usually important to scale the
features in advance to be in the same range. You can use MLlib's
StandardScaler to do this, as seen in “Scaling” on page 222 .
Support Vector Machines
Support Vector Machines, or SVMs, are another binary classification method with
linear separating planes, again expecting labels of 0 or 1. They are available through
the SVMWithSGD class, with similar parameters to linear and logisitic regression. The
returned SVMModel uses a threshold for prediction like LogisticRegressionModel .
Naive Bayes
Naive Bayes is a multiclass classification algorithm that scores how well each point
belongs in each class based on a linear function of the features. It is commonly used
in text classification with TF-IDF features, among other applications. MLlib imple‐
ments Multinomial Naive Bayes, which expects nonnegative frequencies (e.g., word
frequencies) as input features.
In MLlib, you can use Naive Bayes through the mllib.classification.NaiveBayes
class. It supports one parameter, lambda (or lambda_ in Python), used for smoothing.
You can call it on an RDD of LabeledPoint s, where the labels are between 0 and C -1
for C classes.
The returned NaiveBayesModel lets you predict() the class in which a point best
belongs, as well as access the two parameters of the trained model: theta , the matrix
of class probabilities for each feature (of size C × D for C classes and D features), and
pi , the C -dimensional vector of class priors.
Search WWH ::




Custom Search