Database Reference
In-Depth Information
Lazy evaluation means that when we call a transformation on an RDD (for instance,
calling map() ), the operation is not immediately performed. Instead, Spark internally
records metadata to indicate that this operation has been requested. Rather than
thinking of an RDD as containing specific data, it is best to think of each RDD as
consisting of instructions on how to compute the data that we build up through
transformations. Loading data into an RDD is lazily evaluated in the same way trans‐
formations are. So, when we call sc.textFile() , the data is not loaded until it is nec‐
essary. As with transformations, the operation (in this case, reading the data) can
occur multiple times.
Although transformations are lazy, you can force Spark to execute
them at any time by running an action, such as count() . This is an
easy way to test out just part of your program.
Spark uses lazy evaluation to reduce the number of passes it has to take over our data
by grouping operations together. In systems like Hadoop MapReduce, developers
often have to spend a lot of time considering how to group together operations to
minimize the number of MapReduce passes. In Spark, there is no substantial benefit
to writing a single complex map instead of chaining together many simple opera‐
tions. Thus, users are free to organize their program into smaller, more manageable
operations.
Passing Functions to Spark
Most of Spark's transformations, and some of its actions, depend on passing in func‐
tions that are used by Spark to compute data. Each of the core languages has a slightly
different mechanism for passing functions to Spark.
Python
In Python, we have three options for passing functions into Spark. For shorter func‐
tions, we can pass in lambda expressions, as we did in Example 3-2 , and as
Example 3-18 demonstrates. Alternatively, we can pass in top-level functions, or
locally defined functions.
Example 3-18. Passing functions in Python
word = rdd . filter ( lambda s : "error" in s )
def containsError ( s ):
return "error" in s
word = rdd . filter ( containsError )
Search WWH ::




Custom Search