Database Reference
In-Depth Information
categoricalFeaturesInfo
A map specifying which features are categorical, and how many categories they
each have. For example, if feature 1 is a binary feature with labels 0 and 1, and
feature 2 is a three-valued feature with values 0, 1, and 2, you would pass {1: 2,
2: 3} . Use an empty map if no features are categorical.
The online MLlib documentation contains a detailed explanation of the algorithm
used. The cost of the algorithm scales linearly with the number of training examples,
number of features, and maxBins . For large datasets, you may wish to lower maxBins
to train a model faster, though this will also decrease quality.
The train() methods return a DecisionTreeModel . You can use it to predict values
for a new feature vector or an RDD of vectors via predict() , or print the tree using
toDebugString() . This object is serializable, so you can save it using Java Serializa‐
tion and load it in another program.
Finally, in Spark 1.2, MLlib adds an experimental RandomForest class in Java and
Scala to build ensembles of trees, also known as random forests. It is available
through RandomForest.trainClassifier and trainRegressor . Apart from the per-
tree parameters just listed, RandomForest takes the following parameters:
numTrees
How many trees to build. Increasing numTrees decreases the likelihood of over‐
fitting on training data.
featureSubsetStrategy
Number of features to consider for splits at each node; can be auto (let the
library select it), all , sqrt , log2 , or onethird ; larger values are more expensive.
seed Random-number seed to use.
Random forests return a WeightedEnsembleModel that contains several trees (in the
weakHypotheses field, weighted by weakHypothesisWeights ) and can predict() an
RDD or Vector . It also includes a toDebugString to print all the trees.
Clustering
Clustering is the unsupervised learning task that involves grouping objects into clus‐
ters of high similarity. Unlike the supervised tasks seen before, where data is labeled,
clustering can be used to make sense of unlabeled data. It is commonly used in data
exploration (to find what a new dataset looks like) and in anomaly detection (to iden‐
tify points that are far from any cluster).
Search WWH ::




Custom Search