Database Reference
In-Depth Information
hash-partitioned, and operations like reduceByKey() on the join result are going to
be significantly faster.
The flipside, however, is that for transformations that cannot be guaranteed to pro‐
duce a known partitioning, the output RDD will not have a partitioner set. For
example, if you call map() on a hash-partitioned RDD of key/value pairs, the function
passed to map() can in theory change the key of each element, so the result will not
have a partitioner . Spark does not analyze your functions to check whether they
retain the key. Instead, it provides two other operations, mapValues() and flatMap
Values() , which guarantee that each tuple's key remains the same.
All that said, here are all the operations that result in a partitioner being set on the
output RDD: cogroup() , groupWith() , join() , leftOuterJoin() , rightOuter
Join() , groupByKey() , reduceByKey() , combineByKey() , partitionBy() , sort() ,
mapValues() (if the parent RDD has a partitioner), flatMapValues() (if parent has a
partitioner), and filter() (if parent has a partitioner). All other operations will pro‐
duce a result with no partitioner.
Finally, for binary operations, which partitioner is set on the output depends on the
parent RDDs' partitioners. By default, it is a hash partitioner, with the number of par‐
titions set to the level of parallelism of the operation. However, if one of the parents
has a partitioner set, it will be that partitioner; and if both parents have a parti
tioner set, it will be the partitioner of the first parent.
Example: PageRank
As an example of a more involved algorithm that can benefit from RDD partitioning,
we consider PageRank. The PageRank algorithm, named after Google's Larry Page,
aims to assign a measure of importance (a “rank”) to each document in a set based on
how many documents have links to it. It can be used to rank web pages, of course,
but also scientific articles, or influential users in a social network.
PageRank is an iterative algorithm that performs many joins, so it is a good use case
for RDD partitioning. The algorithm maintains two datasets: one of (pageID, link
List) elements containing the list of neighbors of each page, and one of (pageID,
rank) elements containing the current rank for each page. It proceeds as follows:
1. Initialize each page's rank to 1.0.
2. On each iteration, have page p send a contribution of rank(p) / numNeighbors(p)
to its neighbors (the pages it has links to).
3. Set each page's rank to 0.15 + 0.85 * contributionsReceived .
Search WWH ::




Custom Search