Database Reference
In-Depth Information
Function name
We are called with
We return
Function signature on RDD[T]
Iterator of the elements
Nothing
foreachPartition()
f: (Iterator[T]) →
Unit
In addition to avoiding setup work, we can sometimes use mapPartitions() to avoid
object creation overhead. Sometimes we need to make an object for aggregating the
result that is of a different type. Thinking back to Chapter 3 , where we computed the
average, one of the ways we did this was by converting our RDD of numbers to an
RDD of tuples so we could track the number of elements processed in our reduce
step. Instead of doing this for each element, we can instead create the tuple once per
partition, as shown in Examples 6-13 and 6-14 .
Example 6-13. Average without mapPartitions() in Python
def combineCtrs ( c1 , c2 ):
return ( c1 [ 0 ] + c2 [ 0 ], c1 [ 1 ] + c2 [ 1 ])
def basicAvg ( nums ):
"""Compute the average"""
nums . map ( lambda num : ( num , 1 )) . reduce ( combineCtrs )
Example 6-14. Average with mapPartitions() in Python
def partitionCtr ( nums ):
"""Compute sumCounter for partition"""
sumCount = [ 0 , 0 ]
for num in nums :
sumCount [ 0 ] += num
sumCount [ 1 ] += 1
return [ sumCount ]
def fastAvg ( nums ):
"""Compute the avg"""
sumCount = nums . mapPartitions ( partitionCtr ) . reduce ( combineCtrs )
return sumCount [ 0 ] / float ( sumCount [ 1 ])
Piping to External Programs
With three language bindings to choose from out of the box, you may have all the
options you need for writing Spark applications. However, if none of Scala, Java, or
Python does what you need, then Spark provides a general mechanism to pipe data to
programs in other languages, like R scripts.
Spark provides a pipe() method on RDDs. Spark's pipe() lets us write parts of jobs
using any language we want as long as it can read and write to Unix standard
streams. With pipe() , you can write a transformation of an RDD that reads each
 
Search WWH ::




Custom Search