Database Reference
In-Depth Information
Example 3-14. union() transformation in Python
errorsRDD = inputRDD . filter ( lambda x : "error" in x )
warningsRDD = inputRDD . filter ( lambda x : "warning" in x )
badLinesRDD = errorsRDD . union ( warningsRDD )
union() is a bit different than filter() , in that it operates on two RDDs instead of
one. Transformations can actually operate on any number of input RDDs.
A better way to accomplish the same result as in Example 3-14
would be to simply filter the inputRDD once, looking for either
error or warning .
Finally, as you derive new RDDs from each other using transformations, Spark keeps
track of the set of dependencies between different RDDs, called the lineage graph . It
uses this information to compute each RDD on demand and to recover lost data if
part of a persistent RDD is lost. Figure 3-1 shows a lineage graph for Example 3-14 .
Figure 3-1. RDD lineage graph created during log analysis
Actions
We've seen how to create RDDs from each other with transformations, but at some
point, we'll want to actually do something with our dataset. Actions are the second
type of RDD operation. They are the operations that return a final value to the driver
program or write data to an external storage system. Actions force the evaluation of
the transformations required for the RDD they were called on, since they need to
actually produce output.
Continuing the log example from the previous section, we might want to print out
some information about the badLinesRDD . To do that, we'll use two actions, count() ,
which returns the count as a number, and take() , which collects a number of ele‐
ments from the RDD, as shown in Examples 3-15 through 3-17 .
 
Search WWH ::




Custom Search