Database Reference
In-Depth Information
representation requires more instructions per vector element than dense vectors.)
But if going to a sparse representation is the difference between being able to cache
your vectors in memory and not, you should consider a sparse representation even
for denser data.
Level of Parallelism
For most algorithms, you should have at least as many partitions in your input RDD
as the number of cores on your cluster to achieve full parallelism. Recall that Spark
creates a partition for each “block” of a file by default, where a block is typically 64
MB. You can pass a minimum number of partitions to methods like
SparkCon
text.textFile()
to change this—for example,
sc.textFile("data.txt", 10)
.
Alternatively, you can call
repartition(numPartitions)
on your RDD to partition it
equally into
numPartitions
pieces. You can always see the number of partitions in
each RDD on Spark's web UI. At the same time, be careful with adding too many
partitions, because this will increase the communication cost.
Pipeline API
Starting in Spark 1.2, MLlib is adding a new, higher-level API for machine learning,
based on the concept of
pipelines
. This API is similar to the pipeline API in
SciKit-
Learn
. In short, a pipeline is a series of algorithms (either feature transformation or
model fitting) that transform a dataset. Each stage of the pipeline may have
parame‐
ters
(e.g., the number of iterations in
LogisticRegression
). The pipeline API can
automatically search for the best set of parameters using a grid search, evaluating
each set using an evaluation metric of choice.
The pipeline API uses a uniform representation of datasets throughout, which is
SchemaRDDs from Spark SQL in
Chapter 9
. SchemaRDDs have multiple named col‐
umns, making it easy to refer to different fields in the data. Various pipeline stages
may add columns (e.g., a featurized version of the data). The overall concept is also
similar to data frames in R.
To give you a preview of this API, we include a version of the
spam classification
examples
from earlier in the chapter. We also show how to augment the example to
do a grid search over several values of the
HashingTF
and
LogisticRegression
parameters. (See
Example 11-15
.)
Example 11-15. Pipeline API version of spam classification in Scala
import
org.apache.spark.sql.SQLContext
import
org.apache.spark.ml.Pipeline
import
org.apache.spark.ml.classification.LogisticRegression
import
org.apache.spark.ml.feature.
{
HashingTF
,
Tokenizer
}
import
org.apache.spark.ml.tuning.
{
CrossValidator
,
ParamGridBuilder
}