Database Reference
In-Depth Information
categoricalFeaturesInfo
A map specifying which features are categorical, and how many categories they
each have. For example, if feature 1 is a binary feature with labels 0 and 1, and
feature 2 is a three-valued feature with values 0, 1, and 2, you would pass
{1: 2,
2: 3}
. Use an empty map if no features are categorical.
The
online MLlib documentation
contains a detailed explanation of the algorithm
used. The cost of the algorithm scales linearly with the number of training examples,
number of features, and
maxBins
. For large datasets, you may wish to lower
maxBins
to train a model faster, though this will also decrease quality.
The
train()
methods return a
DecisionTreeModel
. You can use it to predict values
for a new feature vector or an RDD of vectors via
predict()
, or print the tree using
toDebugString()
. This object is serializable, so you can save it using Java Serializa‐
tion and load it in another program.
Finally, in Spark 1.2, MLlib adds an experimental
RandomForest
class in Java and
Scala to build ensembles of trees, also known as random forests. It is available
through
RandomForest.trainClassifier
and
trainRegressor
. Apart from the per-
tree parameters just listed,
RandomForest
takes the following parameters:
numTrees
How many trees to build. Increasing
numTrees
decreases the likelihood of over‐
fitting on training data.
featureSubsetStrategy
Number of features to consider for splits at each node; can be
auto
(let the
library select it),
all
,
sqrt
,
log2
, or
onethird
; larger values are more expensive.
seed
Random-number seed to use.
Random forests return a
WeightedEnsembleModel
that contains several trees (in the
weakHypotheses
field, weighted by
weakHypothesisWeights
) and can
predict()
an
RDD or
Vector
. It also includes a
toDebugString
to print all the trees.
Clustering
Clustering is the unsupervised learning task that involves grouping objects into
clus‐
ters
of high similarity. Unlike the supervised tasks seen before, where data is labeled,
clustering can be used to make sense of unlabeled data. It is commonly used in data
exploration (to find what a new dataset looks like) and in anomaly detection (to iden‐
tify points that are far from
any
cluster).