Working with Key/Value Pairs - Learning Spark

Database Reference

In-Depth Information

Failure to persist an RDD after it has been transformed with parti

tionBy() will cause subsequent uses of the RDD to repeat the par‐

titioning of the data. Without persistence, use of the partitioned

RDD will cause reevaluation of the RDDs complete lineage. That

would negate the advantage of partitionBy() , resulting in

repeated partitioning and shuffling of data across the network,

similar to what occurs without any specified partitioner.

In fact, many other Spark operations automatically result in an RDD with known

partitioning information, and many operations other than join() will take advantage

of this information. For example, sortByKey() and groupByKey() will result in

range-partitioned and hash-partitioned RDDs, respectively. On the other hand, oper‐

ations like map() cause the new RDD to forget the parent's partitioning information,

because such operations could theoretically modify the key of each record. The next

few sections describe how to determine how an RDD is partitioned, and exactly how

partitioning affects the various Spark operations.

Partitioning in Java and Python

Spark's Java and Python APIs benefit from partitioning in the same

way as the Scala API. However, in Python, you cannot pass a Hash

Partitioner object to partitionBy ; instead, you just pass the

number of partitions desired (e.g., rdd.partitionBy(100) ).

Determining an RDD's Partitioner

In Scala and Java, you can determine how an RDD is partitioned using its parti

tioner property (or partitioner() method in Java). 2 This returns a scala.Option

object, which is a Scala class for a container that may or may not contain one item.

You can call isDefined() on the Option to check whether it has a value, and get() to

get this value. If present, the value will be a spark.Partitioner object. This is essen‐

tially a function telling the RDD which partition each key goes into; we'll talk more

about this later.

The partitioner property is a great way to test in the Spark shell how different

Spark operations affect partitioning, and to check that the operations you want to do

in your program will yield the right result (see Example 4-24 ).

2 The Python API does not yet offer a way to query partitioners, though it still uses them internally.

Search WWH ::

Custom Search

Home