Database Reference
In-Depth Information
Statistics.corr( rdd , method )
Computes the correlation matrix between columns in an RDD of vectors, using
either the Pearson or Spearman correlation ( method must be one of pearson and
spearman ).
Statistics.corr( rdd1 , rdd2 , method )
Computes the correlation between two RDDs of floating-point values, using
either the Pearson or Spearman correlation ( method must be one of pearson and
spearman ).
Statistics.chiSqTest( rdd )
Computes Pearson's independence test for every feature with the label on an
RDD of LabeledPoint objects. Returns an array of ChiSqTestResult objects that
capture the p-value, test statistic, and degrees of freedom for each feature. Label
and feature values must be categorical (i.e., discrete values).
Apart from these methods, RDDs containing numeric data offer several basic statis‐
tics such as mean() , stdev() , and sum() , as described in “Numeric RDD Operations”
on page 113 . In addition, RDDs support sample() and sampleByKey() to build sim‐
ple and stratified samples of data.
Classification and Regression
Classification and regression are two common forms of supervised learning , where
algorithms attempt to predict a variable from features of objects using labeled train‐
ing data (i.e., examples where we know the answer). The difference between them is
the type of variable predicted: in classification, the variable is discrete (i.e., it takes on
a finite set of values called classes ); for example, classes might be spam or nonspam for
emails, or the language in which the text is written. In regression, the variable predic‐
ted is continuous (e.g., the height of a person given her age and weight).
Both classification and regression use the LabeledPoint class in MLlib, described in
“Data Types” on page 218 , which resides in the mllib.regression package. A Label
edPoint consists simply of a label (which is always a Double value, but can be set to
discrete integers for classification) and a features vector.
For binary classification, MLlib expects the labels 0 and 1. In some
texts, -1 and 1 are used instead, but this will lead to incorrect
results. For multiclass classification, MLlib expects labels from 0 to
C -1, where C is the number of classes.
MLlib includes a variety of methods for classification and regression, including sim‐
ple linear methods and decision trees and forests .
Search WWH ::




Custom Search