Database Reference
In-Depth Information
Failure to persist an RDD after it has been transformed with
parti
tionBy()
will cause subsequent uses of the RDD to repeat the par‐
titioning of the data. Without persistence, use of the partitioned
RDD will cause reevaluation of the RDDs complete lineage. That
would negate the advantage of
partitionBy()
, resulting in
repeated partitioning and shuffling of data across the network,
similar to what occurs without any specified partitioner.
In fact, many other Spark operations automatically result in an RDD with known
partitioning information, and many operations other than
join()
will take advantage
of this information. For example,
sortByKey()
and
groupByKey()
will result in
range-partitioned and hash-partitioned RDDs, respectively. On the other hand, oper‐
ations like
map()
cause the new RDD to
forget
the parent's partitioning information,
because such operations could theoretically modify the key of each record. The next
few sections describe how to determine how an RDD is partitioned, and exactly how
partitioning affects the various Spark operations.
Partitioning in Java and Python
Spark's Java and Python APIs benefit from partitioning in the same
way as the Scala API. However, in Python, you cannot pass a
Hash
Partitioner
object to
partitionBy
; instead, you just pass the
number of partitions desired (e.g.,
rdd.partitionBy(100)
).
Determining an RDD's Partitioner
In Scala and Java, you can determine how an RDD is partitioned using its
parti
tioner
property (or
partitioner()
method in Java).
2
This returns a
scala.Option
object, which is a Scala class for a container that may or may not contain one item.
You can call
isDefined()
on the
Option
to check whether it has a value, and
get()
to
get this value. If present, the value will be a
spark.Partitioner
object. This is essen‐
tially a function telling the RDD which partition each key goes into; we'll talk more
about this later.
The
partitioner
property is a great way to test in the Spark shell how different
Spark operations affect partitioning, and to check that the operations you want to do
in your program will yield the right result (see
Example 4-24
).
2
The Python API does not yet offer a way to query partitioners, though it still uses them internally.