Programming with RDDs - Learning Spark

Database Reference

In-Depth Information

Once created, RDDs offer two types of operations: transformations and actions .

Transformations construct a new RDD from a previous one. For example, one com‐

mon transformation is filtering data that matches a predicate. In our text file exam‐

ple, we can use this to create a new RDD holding just the strings that contain the

word Python , as shown in Example 3-2 .

Example 3-2. Calling the filter() transformation

>>> pythonLines = lines . filter ( lambda line : "Python" in line )

Actions , on the other hand, compute a result based on an RDD, and either return it to

the driver program or save it to an external storage system (e.g., HDFS). One example

of an action we called earlier is first() , which returns the first element in an RDD

and is demonstrated in Example 3-3 .

Example 3-3. Calling the first() action

>>> pythonLines . first ()

u'## Interactive Python Shell'

Transformations and actions are different because of the way Spark computes RDDs.

Although you can define new RDDs any time, Spark computes them only in a lazy

fashion—that is, the first time they are used in an action. This approach might seem

unusual at first, but makes a lot of sense when you are working with Big Data. For

instance, consider Example 3-2 and Example 3-3 , where we defined a text file and

then filtered the lines that include Python . If Spark were to load and store all the lines

in the file as soon as we wrote lines = sc.textFile(...) , it would waste a lot of

storage space, given that we then immediately filter out many lines. Instead, once

Spark sees the whole chain of transformations, it can compute just the data needed

for its result. In fact, for the first() action, Spark scans the file only until it finds the

first matching line; it doesn't even read the whole file.

Finally, Spark's RDDs are by default recomputed each time you run an action on

them. If you would like to reuse an RDD in multiple actions, you can ask Spark to

persist it using RDD.persist() . We can ask Spark to persist our data in a number of

different places, which will be covered in Table 3-6 . After computing it the first time,

Spark will store the RDD contents in memory (partitioned across the machines in

your cluster), and reuse them in future actions. Persisting RDDs on disk instead of

memory is also possible. The behavior of not persisting by default may again seem

unusual, but it makes a lot of sense for big datasets: if you will not reuse the RDD,

Search WWH ::

Custom Search

Home