Database Reference
In-Depth Information
representation requires more instructions per vector element than dense vectors.)
But if going to a sparse representation is the difference between being able to cache
your vectors in memory and not, you should consider a sparse representation even
for denser data.
Level of Parallelism
For most algorithms, you should have at least as many partitions in your input RDD
as the number of cores on your cluster to achieve full parallelism. Recall that Spark
creates a partition for each “block” of a file by default, where a block is typically 64
MB. You can pass a minimum number of partitions to methods like SparkCon
text.textFile() to change this—for example, sc.textFile("data.txt", 10) .
Alternatively, you can call repartition(numPartitions) on your RDD to partition it
equally into numPartitions pieces. You can always see the number of partitions in
each RDD on Spark's web UI. At the same time, be careful with adding too many
partitions, because this will increase the communication cost.
Pipeline API
Starting in Spark 1.2, MLlib is adding a new, higher-level API for machine learning,
based on the concept of pipelines . This API is similar to the pipeline API in SciKit-
Learn . In short, a pipeline is a series of algorithms (either feature transformation or
model fitting) that transform a dataset. Each stage of the pipeline may have parame‐
ters (e.g., the number of iterations in LogisticRegression ). The pipeline API can
automatically search for the best set of parameters using a grid search, evaluating
each set using an evaluation metric of choice.
The pipeline API uses a uniform representation of datasets throughout, which is
SchemaRDDs from Spark SQL in Chapter 9 . SchemaRDDs have multiple named col‐
umns, making it easy to refer to different fields in the data. Various pipeline stages
may add columns (e.g., a featurized version of the data). The overall concept is also
similar to data frames in R.
To give you a preview of this API, we include a version of the spam classification
examples from earlier in the chapter. We also show how to augment the example to
do a grid search over several values of the HashingTF and LogisticRegression
parameters. (See Example 11-15 .)
Example 11-15. Pipeline API version of spam classification in Scala
import org.apache.spark.sql.SQLContext
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature. { HashingTF , Tokenizer }
import org.apache.spark.ml.tuning. { CrossValidator , ParamGridBuilder }
Search WWH ::




Custom Search