Working with Key/Value Pairs - Learning Spark

Database Reference

In-Depth Information

Example 4-15. reduceByKey() with custom parallelism in Python

data = [( "a" , 3 ), ( "b" , 4 ), ( "a" , 1 )]

sc . parallelize ( data ) . reduceByKey ( lambda x , y : x + y ) # Default parallelism

sc . parallelize ( data ) . reduceByKey ( lambda x , y : x + y , 10 ) # Custom parallelism

Example 4-16. reduceByKey() with custom parallelism in Scala

val data = Seq (( "a" , 3 ), ( "b" , 4 ), ( "a" , 1 ))

sc . parallelize ( data ). reduceByKey (( x , y ) => x + y ) // Default parallelism

sc . parallelize ( data ). reduceByKey (( x , y ) => x + y ) // Custom parallelism

Sometimes, we want to change the partitioning of an RDD outside the context of

grouping and aggregation operations. For those cases, Spark provides the reparti

tion() function, which shuffles the data across the network to create a new set of

partitions. Keep in mind that repartitioning your data is a fairly expensive operation.

Spark also has an optimized version of repartition() called coalesce() that allows

avoiding data movement, but only if you are decreasing the number of RDD parti‐

tions. To know whether you can safely call coalesce() , you can check the size of the

RDD using rdd.partitions.size() in Java/Scala and rdd.getNumPartitions() in

Python and make sure that you are coalescing it to fewer partitions than it currently

has.

Grouping Data

With keyed data a common use case is grouping our data by key—for example, view‐

ing all of a customer's orders together.

If our data is already keyed in the way we want, groupByKey() will group our data

using the key in our RDD. On an RDD consisting of keys of type K and values of type

V , we get back an RDD of type [K, Iterable[V]] .

groupBy() works on unpaired data or data where we want to use a different condi‐

tion besides equality on the current key. It takes a function that it applies to every

element in the source RDD and uses the result to determine the key.

If you find yourself writing code where you groupByKey() and

then use a reduce() or fold() on the values, you can probably

achieve the same result more efficiently by using one of the per-key

aggregation functions. Rather than reducing the RDD to an in-

memory value, we reduce the data per key and get back an RDD

with the reduced values corresponding to each key. For example,

rdd.reduceByKey(func) produces the same RDD as rdd.groupBy

Key().mapValues(value => value.reduce(func)) but is more

efficient as it avoids the step of creating a list of values for each key.

Search WWH ::

Custom Search

Home