Database Reference
In-Depth Information
Once created, RDDs offer two types of operations: transformations and actions .
Transformations construct a new RDD from a previous one. For example, one com‐
mon transformation is filtering data that matches a predicate. In our text file exam‐
ple, we can use this to create a new RDD holding just the strings that contain the
word Python , as shown in Example 3-2 .
Example 3-2. Calling the filter() transformation
>>> pythonLines = lines . filter ( lambda line : "Python" in line )
Actions , on the other hand, compute a result based on an RDD, and either return it to
the driver program or save it to an external storage system (e.g., HDFS). One example
of an action we called earlier is first() , which returns the first element in an RDD
and is demonstrated in Example 3-3 .
Example 3-3. Calling the first() action
>>> pythonLines . first ()
u'## Interactive Python Shell'
Transformations and actions are different because of the way Spark computes RDDs.
Although you can define new RDDs any time, Spark computes them only in a lazy
fashion—that is, the first time they are used in an action. This approach might seem
unusual at first, but makes a lot of sense when you are working with Big Data. For
instance, consider Example 3-2 and Example 3-3 , where we defined a text file and
then filtered the lines that include Python . If Spark were to load and store all the lines
in the file as soon as we wrote lines = sc.textFile(...) , it would waste a lot of
storage space, given that we then immediately filter out many lines. Instead, once
Spark sees the whole chain of transformations, it can compute just the data needed
for its result. In fact, for the first() action, Spark scans the file only until it finds the
first matching line; it doesn't even read the whole file.
Finally, Spark's RDDs are by default recomputed each time you run an action on
them. If you would like to reuse an RDD in multiple actions, you can ask Spark to
persist it using RDD.persist() . We can ask Spark to persist our data in a number of
different places, which will be covered in Table 3-6 . After computing it the first time,
Spark will store the RDD contents in memory (partitioned across the machines in
your cluster), and reuse them in future actions. Persisting RDDs on disk instead of
memory is also possible. The behavior of not persisting by default may again seem
unusual, but it makes a lot of sense for big datasets: if you will not reuse the RDD,
Search WWH ::




Custom Search