Database Reference
In-Depth Information
there's no reason to waste storage space when Spark could instead stream through the
data once and just compute the result. 1
In practice, you will often use persist() to load a subset of your data into memory
and query it repeatedly. For example, if we knew that we wanted to compute multiple
results about the README lines that contain Python , we could write the script
shown in Example 3-4 .
Example 3-4. Persisting an RDD in memory
>>> pythonLines . persist
>>> pythonLines . count ()
2
>>> pythonLines . first ()
u'## Interactive Python Shell'
To summarize, every Spark program and shell session will work as follows:
1. Create some input RDDs from external data.
2. Transform them to define new RDDs using transformations like filter() .
3. Ask Spark to persist() any intermediate RDDs that will need to be reused.
4. Launch actions such as count() and first() to kick off a parallel computation,
which is then optimized and executed by Spark.
cache() is the same as calling persist() with the default storage
level.
In the rest of this chapter, we'll go through each of these steps in detail, and cover
some of the most common RDD operations in Spark.
Creating RDDs
Spark provides two ways to create RDDs: loading an external dataset and paralleliz‐
ing a collection in your driver program.
1 The ability to always recompute an RDD is actually why RDDs are called “resilient.” When a machine holding
RDD data fails, Spark uses this ability to recompute the missing partitions, transparent to the user.
 
Search WWH ::




Custom Search