Machine Learning with MLlib - Learning Spark

Database Reference

In-Depth Information

categoricalFeaturesInfo

A map specifying which features are categorical, and how many categories they

each have. For example, if feature 1 is a binary feature with labels 0 and 1, and

feature 2 is a three-valued feature with values 0, 1, and 2, you would pass {1: 2,

2: 3} . Use an empty map if no features are categorical.

The online MLlib documentation contains a detailed explanation of the algorithm

used. The cost of the algorithm scales linearly with the number of training examples,

number of features, and maxBins . For large datasets, you may wish to lower maxBins

to train a model faster, though this will also decrease quality.

The train() methods return a DecisionTreeModel . You can use it to predict values

for a new feature vector or an RDD of vectors via predict() , or print the tree using

toDebugString() . This object is serializable, so you can save it using Java Serializa‐

tion and load it in another program.

Finally, in Spark 1.2, MLlib adds an experimental RandomForest class in Java and

Scala to build ensembles of trees, also known as random forests. It is available

through RandomForest.trainClassifier and trainRegressor . Apart from the per-

tree parameters just listed, RandomForest takes the following parameters:

numTrees

How many trees to build. Increasing numTrees decreases the likelihood of over‐

fitting on training data.

featureSubsetStrategy

Number of features to consider for splits at each node; can be auto (let the

library select it), all , sqrt , log2 , or onethird ; larger values are more expensive.

seed Random-number seed to use.

Random forests return a WeightedEnsembleModel that contains several trees (in the

weakHypotheses field, weighted by weakHypothesisWeights ) and can predict() an

RDD or Vector . It also includes a toDebugString to print all the trees.

Clustering

Clustering is the unsupervised learning task that involves grouping objects into clus‐

ters of high similarity. Unlike the supervised tasks seen before, where data is labeled,

clustering can be used to make sense of unlabeled data. It is commonly used in data

exploration (to find what a new dataset looks like) and in anomaly detection (to iden‐

tify points that are far from any cluster).

Search WWH ::

Custom Search

Home