Advanced Spark Programming - Learning Spark

Database Reference

In-Depth Information

Function name

We are called with

We return

Function signature on RDD[T]

Iterator of the elements

Nothing

foreachPartition()

f: (Iterator[T]) →

Unit

In addition to avoiding setup work, we can sometimes use mapPartitions() to avoid

object creation overhead. Sometimes we need to make an object for aggregating the

result that is of a different type. Thinking back to Chapter 3 , where we computed the

average, one of the ways we did this was by converting our RDD of numbers to an

RDD of tuples so we could track the number of elements processed in our reduce

step. Instead of doing this for each element, we can instead create the tuple once per

partition, as shown in Examples 6-13 and 6-14 .

Example 6-13. Average without mapPartitions() in Python

def combineCtrs ( c1 , c2 ):

return ( c1 [ 0 ] + c2 [ 0 ], c1 [ 1 ] + c2 [ 1 ])

def basicAvg ( nums ):

"""Compute the average"""

nums . map ( lambda num : ( num , 1 )) . reduce ( combineCtrs )

Example 6-14. Average with mapPartitions() in Python

def partitionCtr ( nums ):

"""Compute sumCounter for partition"""

sumCount = [ 0 , 0 ]

for num in nums :

sumCount [ 0 ] += num

sumCount [ 1 ] += 1

return [ sumCount ]

def fastAvg ( nums ):

"""Compute the avg"""

sumCount = nums . mapPartitions ( partitionCtr ) . reduce ( combineCtrs )

return sumCount [ 0 ] / float ( sumCount [ 1 ])

Piping to External Programs

With three language bindings to choose from out of the box, you may have all the

options you need for writing Spark applications. However, if none of Scala, Java, or

Python does what you need, then Spark provides a general mechanism to pipe data to

programs in other languages, like R scripts.

Spark provides a pipe() method on RDDs. Spark's pipe() lets us write parts of jobs

using any language we want as long as it can read and write to Unix standard

streams. With pipe() , you can write a transformation of an RDD that reads each

Search WWH ::

Custom Search

Home