Database Reference
In-Depth Information
Example 4-15. reduceByKey() with custom parallelism in Python
data = [( "a" , 3 ), ( "b" , 4 ), ( "a" , 1 )]
sc . parallelize ( data ) . reduceByKey ( lambda x , y : x + y ) # Default parallelism
sc . parallelize ( data ) . reduceByKey ( lambda x , y : x + y , 10 ) # Custom parallelism
Example 4-16. reduceByKey() with custom parallelism in Scala
val data = Seq (( "a" , 3 ), ( "b" , 4 ), ( "a" , 1 ))
sc . parallelize ( data ). reduceByKey (( x , y ) => x + y ) // Default parallelism
sc . parallelize ( data ). reduceByKey (( x , y ) => x + y ) // Custom parallelism
Sometimes, we want to change the partitioning of an RDD outside the context of
grouping and aggregation operations. For those cases, Spark provides the reparti
tion() function, which shuffles the data across the network to create a new set of
partitions. Keep in mind that repartitioning your data is a fairly expensive operation.
Spark also has an optimized version of repartition() called coalesce() that allows
avoiding data movement, but only if you are decreasing the number of RDD parti‐
tions. To know whether you can safely call coalesce() , you can check the size of the
RDD using rdd.partitions.size() in Java/Scala and rdd.getNumPartitions() in
Python and make sure that you are coalescing it to fewer partitions than it currently
has.
Grouping Data
With keyed data a common use case is grouping our data by key—for example, view‐
ing all of a customer's orders together.
If our data is already keyed in the way we want, groupByKey() will group our data
using the key in our RDD. On an RDD consisting of keys of type K and values of type
V , we get back an RDD of type [K, Iterable[V]] .
groupBy() works on unpaired data or data where we want to use a different condi‐
tion besides equality on the current key. It takes a function that it applies to every
element in the source RDD and uses the result to determine the key.
If you find yourself writing code where you groupByKey() and
then use a reduce() or fold() on the values, you can probably
achieve the same result more efficiently by using one of the per-key
aggregation functions. Rather than reducing the RDD to an in-
memory value, we reduce the data per key and get back an RDD
with the reduced values corresponding to each key. For example,
rdd.reduceByKey(func) produces the same RDD as rdd.groupBy
Key().mapValues(value => value.reduce(func)) but is more
efficient as it avoids the step of creating a list of values for each key.
Search WWH ::




Custom Search