Database Reference
In-Depth Information
Failure to persist an RDD after it has been transformed with parti
tionBy() will cause subsequent uses of the RDD to repeat the par‐
titioning of the data. Without persistence, use of the partitioned
RDD will cause reevaluation of the RDDs complete lineage. That
would negate the advantage of partitionBy() , resulting in
repeated partitioning and shuffling of data across the network,
similar to what occurs without any specified partitioner.
In fact, many other Spark operations automatically result in an RDD with known
partitioning information, and many operations other than join() will take advantage
of this information. For example, sortByKey() and groupByKey() will result in
range-partitioned and hash-partitioned RDDs, respectively. On the other hand, oper‐
ations like map() cause the new RDD to forget the parent's partitioning information,
because such operations could theoretically modify the key of each record. The next
few sections describe how to determine how an RDD is partitioned, and exactly how
partitioning affects the various Spark operations.
Partitioning in Java and Python
Spark's Java and Python APIs benefit from partitioning in the same
way as the Scala API. However, in Python, you cannot pass a Hash
Partitioner object to partitionBy ; instead, you just pass the
number of partitions desired (e.g., rdd.partitionBy(100) ).
Determining an RDD's Partitioner
In Scala and Java, you can determine how an RDD is partitioned using its parti
tioner property (or partitioner() method in Java). 2 This returns a scala.Option
object, which is a Scala class for a container that may or may not contain one item.
You can call isDefined() on the Option to check whether it has a value, and get() to
get this value. If present, the value will be a spark.Partitioner object. This is essen‐
tially a function telling the RDD which partition each key goes into; we'll talk more
about this later.
The partitioner property is a great way to test in the Spark shell how different
Spark operations affect partitioning, and to check that the operations you want to do
in your program will yield the right result (see Example 4-24 ).
2 The Python API does not yet offer a way to query partitioners, though it still uses them internally.
 
Search WWH ::




Custom Search