Machine Learning with MLlib - Learning Spark

Database Reference

In-Depth Information

Statistics.corr( rdd , method )

Computes the correlation matrix between columns in an RDD of vectors, using

either the Pearson or Spearman correlation ( method must be one of pearson and

spearman ).

Statistics.corr( rdd1 , rdd2 , method )

Computes the correlation between two RDDs of floating-point values, using

either the Pearson or Spearman correlation ( method must be one of pearson and

spearman ).

Statistics.chiSqTest( rdd )

Computes Pearson's independence test for every feature with the label on an

RDD of LabeledPoint objects. Returns an array of ChiSqTestResult objects that

capture the p-value, test statistic, and degrees of freedom for each feature. Label

and feature values must be categorical (i.e., discrete values).

Apart from these methods, RDDs containing numeric data offer several basic statis‐

tics such as mean() , stdev() , and sum() , as described in “Numeric RDD Operations”

on page 113 . In addition, RDDs support sample() and sampleByKey() to build sim‐

ple and stratified samples of data.

Classification and Regression

Classification and regression are two common forms of supervised learning , where

algorithms attempt to predict a variable from features of objects using labeled train‐

ing data (i.e., examples where we know the answer). The difference between them is

the type of variable predicted: in classification, the variable is discrete (i.e., it takes on

a finite set of values called classes ); for example, classes might be spam or nonspam for

emails, or the language in which the text is written. In regression, the variable predic‐

ted is continuous (e.g., the height of a person given her age and weight).

Both classification and regression use the LabeledPoint class in MLlib, described in

“Data Types” on page 218 , which resides in the mllib.regression package. A Label

edPoint consists simply of a label (which is always a Double value, but can be set to

discrete integers for classification) and a features vector.

For binary classification, MLlib expects the labels 0 and 1. In some

texts, -1 and 1 are used instead, but this will lead to incorrect

results. For multiclass classification, MLlib expects labels from 0 to

C -1, where C is the number of classes.

MLlib includes a variety of methods for classification and regression, including sim‐

ple linear methods and decision trees and forests .

Search WWH ::

Custom Search

Home