Database Reference
In-Depth Information
Resilient Distributed Datasets
RDDs are at the heart of every Spark program, so in this section we look at how to work
with them in more detail.
Creation
There are three ways of creating RDDs: from an in-memory collection of objects (known
as
parallelizing
a collection), using a dataset from external storage (such as HDFS), or
transforming an existing RDD. The first way is useful for doing CPU-intensive computa-
tions on small amounts of input data in parallel. For example, the following runs separate
val
params
=
sc
.
parallelize
(
1
to
10
)
val
result
=
params
.
map
(
performExpensiveComputation
)
The
performExpensiveComputation
function is run on input values in parallel.
The level of parallelism is determined from the
spark.default.parallelism
prop-
erty, which has a default value that depends on where the Spark job is running. When run-
ning locally it is the number of cores on the machine, while for a cluster it is the total num-
ber of cores on all executor nodes in the cluster.
You can also override the level of parallelism for a particular computation by passing it as
the second argument to
parallelize()
:
sc
.
parallelize
(
1
to
10
,
10
)
The second way to create an RDD is by creating a reference to an external dataset. We have
already seen how to create an RDD of
String
objects for a text file:
val
text
:
RDD
[
String
]
=
sc
.
textFile
(
inputPath
)
The path may be any Hadoop filesystem path, such as a file on the local filesystem or on
HDFS. Internally, Spark uses
TextInputFormat
from the old MapReduce API to read
the file (see
TextInputFormat
)
. This means that the file-splitting behavior is the same as in
Hadoop itself, so in the case of HDFS there is one Spark partition per HDFS block. The de-
fault can be changed by passing a second argument to request a particular number of splits:
sc
.
textFile
(
inputPath
,
10
)
Another variant permits text files to be processed as whole files (similar to
Processing a
whole file as a record
) by returning an RDD of string pairs, where the first string is the file
path and the second is the file contents. Since each file is loaded into memory, this is only
suitable for small files: