Working with Key/Value Pairs - Learning Spark

Database Reference

In-Depth Information

Example 4-24. Determining partitioner of an RDD

scala > val pairs = sc . parallelize ( List (( 1 , 1 ), ( 2 , 2 ), ( 3 , 3 )))

pairs : spark.RDD [( Int , Int )] = ParallelCollectionRDD [ 0 ] at parallelize at < console >: 12

scala > pairs . partitioner

res0 : Option [ spark.Partitioner ] = None

scala > val partitioned = pairs . partitionBy ( new spark . HashPartitioner ( 2 ))

partitioned : spark.RDD [( Int , Int )] = ShuffledRDD [ 1 ] at partitionBy at < console >: 14

scala > partitioned . partitioner

res1 : Option [ spark.Partitioner ] = Some ( spark . HashPartitioner @ 5147788 d )

In this short session, we created an RDD of (Int, Int) pairs, which initially have no

partitioning information (an Option with value None ). We then created a second

RDD by hash-partitioning the first. If we actually wanted to use partitioned in fur‐

ther operations, then we should have appended persist() to the third line of input,

in which partitioned is defined. This is for the same reason that we needed per

sist() for userData in the previous example: without persist() , subsequent RDD

actions will evaluate the entire lineage of partitioned , which will cause pairs to be

hash-partitioned over and over.

Operations That Benefit from Partitioning

Many of Spark's operations involve shuffling data by key across the network. All of

these will benefit from partitioning. As of Spark 1.0, the operations that benefit from

partitioning are cogroup() , groupWith() , join() , leftOuterJoin() , rightOuter

Join() , groupByKey() , reduceByKey() , combineByKey() , and lookup() .

For operations that act on a single RDD, such as reduceByKey() , running on a pre-

partitioned RDD will cause all the values for each key to be computed locally on a

single machine, requiring only the final, locally reduced value to be sent from each

worker node back to the master. For binary operations, such as cogroup() and

join() , pre-partitioning will cause at least one of the RDDs (the one with the known

partitioner) to not be shuffled. If both RDDs have the same partitioner, and if they

are cached on the same machines (e.g., one was created using mapValues() on the

other, which preserves keys and partitioning) or if one of them has not yet been com‐

puted, then no shuffling across the network will occur.

Operations That Affect Partitioning

Spark knows internally how each of its operations affects partitioning, and automati‐

cally sets the partitioner on RDDs created by operations that partition the data. For

example, suppose you called join() to join two RDDs; because the elements with the

same key have been hashed to the same machine, Spark knows that the result is

Search WWH ::

Custom Search

Home