Database Reference
In-Depth Information
considerable network traffic over a simple implementation of PageRank (e.g., in
plain MapReduce).
2. For the same reason, we call
persist()
on
links
to keep it in RAM across
iterations.
3. When we first create
ranks
, we use
mapValues()
instead of
map()
to preserve the
partitioning of the parent RDD (
links
), so that our first join against it is cheap.
4. In the loop body, we follow our
reduceByKey()
with
mapValues()
; because the
result of
reduceByKey()
is already hash-partitioned, this will make it more effi‐
cient to join the mapped result against
links
on the next iteration.
To maximize the potential for partitioning-related optimizations,
you should use
mapValues()
or
flatMapValues()
whenever you
are not changing an element's key.
Custom Partitioners
While Spark's
HashPartitioner
and
RangePartitioner
are well suited to many use
cases, Spark also allows you to tune how an RDD is partitioned by providing a cus‐
tom
Partitioner
object. This can help you further reduce communication by taking
advantage of domain-specific knowledge.
For example, suppose we wanted to run the PageRank algorithm in the previous sec‐
tion on a set of web pages. Here each page's ID (the key in our RDD) will be its URL.
Using a simple hash function to do the partitioning, pages with similar URLs (e.g.,
http://www.cnn.com/WORLD
and
http://www.cnn.com/US
) might be hashed to com‐
pletely different nodes. However, we know that web pages within the same domain
tend to link to each other a lot. Because PageRank needs to send a message from each
page to each of its neighbors on each iteration, it helps to group these pages into the
same partition. We can do this with a custom
Partitioner
that looks at just the
domain name instead of the whole URL.