Database Reference
In-Depth Information
Resilient Distributed Datasets
RDDs are at the heart of every Spark program, so in this section we look at how to work
with them in more detail.
Creation
There are three ways of creating RDDs: from an in-memory collection of objects (known
as parallelizing a collection), using a dataset from external storage (such as HDFS), or
transforming an existing RDD. The first way is useful for doing CPU-intensive computa-
tions on small amounts of input data in parallel. For example, the following runs separate
computations on the numbers from 1 to 10: [ 130 ]
val params = sc . parallelize ( 1 to 10 )
val result = params . map ( performExpensiveComputation )
The performExpensiveComputation function is run on input values in parallel.
The level of parallelism is determined from the spark.default.parallelism prop-
erty, which has a default value that depends on where the Spark job is running. When run-
ning locally it is the number of cores on the machine, while for a cluster it is the total num-
ber of cores on all executor nodes in the cluster.
You can also override the level of parallelism for a particular computation by passing it as
the second argument to parallelize() :
sc . parallelize ( 1 to 10 , 10 )
The second way to create an RDD is by creating a reference to an external dataset. We have
already seen how to create an RDD of String objects for a text file:
val text : RDD [ String ] = sc . textFile ( inputPath )
The path may be any Hadoop filesystem path, such as a file on the local filesystem or on
HDFS. Internally, Spark uses TextInputFormat from the old MapReduce API to read
the file (see TextInputFormat ) . This means that the file-splitting behavior is the same as in
Hadoop itself, so in the case of HDFS there is one Spark partition per HDFS block. The de-
fault can be changed by passing a second argument to request a particular number of splits:
sc . textFile ( inputPath , 10 )
Another variant permits text files to be processed as whole files (similar to Processing a
whole file as a record ) by returning an RDD of string pairs, where the first string is the file
path and the second is the file contents. Since each file is loaded into memory, this is only
suitable for small files:
Search WWH ::




Custom Search