Working with Key/Value Pairs - Learning Spark

Database Reference

In-Depth Information

hash-partitioned, and operations like reduceByKey() on the join result are going to

be significantly faster.

The flipside, however, is that for transformations that cannot be guaranteed to pro‐

duce a known partitioning, the output RDD will not have a partitioner set. For

example, if you call map() on a hash-partitioned RDD of key/value pairs, the function

passed to map() can in theory change the key of each element, so the result will not

have a partitioner . Spark does not analyze your functions to check whether they

retain the key. Instead, it provides two other operations, mapValues() and flatMap

Values() , which guarantee that each tuple's key remains the same.

All that said, here are all the operations that result in a partitioner being set on the

output RDD: cogroup() , groupWith() , join() , leftOuterJoin() , rightOuter

Join() , groupByKey() , reduceByKey() , combineByKey() , partitionBy() , sort() ,

mapValues() (if the parent RDD has a partitioner), flatMapValues() (if parent has a

partitioner), and filter() (if parent has a partitioner). All other operations will pro‐

duce a result with no partitioner.

Finally, for binary operations, which partitioner is set on the output depends on the

parent RDDs' partitioners. By default, it is a hash partitioner, with the number of par‐

titions set to the level of parallelism of the operation. However, if one of the parents

has a partitioner set, it will be that partitioner; and if both parents have a parti

tioner set, it will be the partitioner of the first parent.

Example: PageRank

As an example of a more involved algorithm that can benefit from RDD partitioning,

we consider PageRank. The PageRank algorithm, named after Google's Larry Page,

aims to assign a measure of importance (a “rank”) to each document in a set based on

how many documents have links to it. It can be used to rank web pages, of course,

but also scientific articles, or influential users in a social network.

PageRank is an iterative algorithm that performs many joins, so it is a good use case

for RDD partitioning. The algorithm maintains two datasets: one of (pageID, link

List) elements containing the list of neighbors of each page, and one of (pageID,

rank) elements containing the current rank for each page. It proceeds as follows:

1. Initialize each page's rank to 1.0.

2. On each iteration, have page p send a contribution of rank(p) / numNeighbors(p)

to its neighbors (the pages it has links to).

3. Set each page's rank to 0.15 + 0.85 * contributionsReceived .

Search WWH ::

Custom Search

Home