Database Reference
In-Depth Information
Statistics.corr(
rdd
,
method
)
Computes the correlation matrix between columns in an RDD of vectors, using
either the Pearson or Spearman correlation (
method
must be one of
pearson
and
spearman
).
Statistics.corr(
rdd1
,
rdd2
,
method
)
Computes the correlation between two RDDs of floating-point values, using
either the Pearson or Spearman correlation (
method
must be one of
pearson
and
spearman
).
Statistics.chiSqTest(
rdd
)
Computes Pearson's independence test for every feature with the label on an
RDD of
LabeledPoint
objects. Returns an array of
ChiSqTestResult
objects that
capture the p-value, test statistic, and degrees of freedom for each feature. Label
and feature values must be categorical (i.e., discrete values).
Apart from these methods, RDDs containing numeric data offer several basic statis‐
tics such as
mean()
,
stdev()
, and
sum()
, as described in
“Numeric RDD Operations”
on page 113
. In addition, RDDs support
sample()
and
sampleByKey()
to build sim‐
ple and stratified samples of data.
Classification and Regression
Classification and regression are two common forms of
supervised learning
, where
algorithms attempt to predict a variable from features of objects using labeled train‐
ing data (i.e., examples where we know the answer). The difference between them is
the type of variable predicted: in classification, the variable is
discrete
(i.e., it takes on
a finite set of values called
classes
); for example, classes might be
spam
or
nonspam
for
emails, or the language in which the text is written. In regression, the variable predic‐
ted is
continuous
(e.g., the height of a person given her age and weight).
Both classification and regression use the
LabeledPoint
class in MLlib, described in
“Data Types” on page 218
, which resides in the
mllib.regression
package. A
Label
edPoint
consists simply of a
label
(which is always a
Double
value, but can be set to
discrete integers for classification) and a
features
vector.
For binary classification, MLlib expects the labels 0 and 1. In some
texts, -1 and 1 are used instead, but this will lead to incorrect
results. For multiclass classification, MLlib expects labels from 0 to
C
-1, where
C
is the number of classes.
MLlib includes a variety of methods for classification and regression, including sim‐
ple linear methods and
decision trees and forests
.