Programming with RDDs - Learning Spark

Database Reference

In-Depth Information

there's no reason to waste storage space when Spark could instead stream through the

data once and just compute the result. 1

In practice, you will often use persist() to load a subset of your data into memory

and query it repeatedly. For example, if we knew that we wanted to compute multiple

results about the README lines that contain Python , we could write the script

shown in Example 3-4 .

Example 3-4. Persisting an RDD in memory

>>> pythonLines . persist

>>> pythonLines . count ()

2

>>> pythonLines . first ()

u'## Interactive Python Shell'

To summarize, every Spark program and shell session will work as follows:

1. Create some input RDDs from external data.

2. Transform them to define new RDDs using transformations like filter() .

3. Ask Spark to persist() any intermediate RDDs that will need to be reused.

4. Launch actions such as count() and first() to kick off a parallel computation,

which is then optimized and executed by Spark.

cache() is the same as calling persist() with the default storage

level.

In the rest of this chapter, we'll go through each of these steps in detail, and cover

some of the most common RDD operations in Spark.

Creating RDDs

Spark provides two ways to create RDDs: loading an external dataset and paralleliz‐

ing a collection in your driver program.

1 The ability to always recompute an RDD is actually why RDDs are called “resilient.” When a machine holding

RDD data fails, Spark uses this ability to recompute the missing partitions, transparent to the user.

Search WWH ::

Custom Search

Home