Database Reference
In-Depth Information
considerable network traffic over a simple implementation of PageRank (e.g., in
plain MapReduce).
2. For the same reason, we call persist() on links to keep it in RAM across
iterations.
3. When we first create ranks , we use mapValues() instead of map() to preserve the
partitioning of the parent RDD ( links ), so that our first join against it is cheap.
4. In the loop body, we follow our reduceByKey() with mapValues() ; because the
result of reduceByKey() is already hash-partitioned, this will make it more effi‐
cient to join the mapped result against links on the next iteration.
To maximize the potential for partitioning-related optimizations,
you should use mapValues() or flatMapValues() whenever you
are not changing an element's key.
Custom Partitioners
While Spark's HashPartitioner and RangePartitioner are well suited to many use
cases, Spark also allows you to tune how an RDD is partitioned by providing a cus‐
tom Partitioner object. This can help you further reduce communication by taking
advantage of domain-specific knowledge.
For example, suppose we wanted to run the PageRank algorithm in the previous sec‐
tion on a set of web pages. Here each page's ID (the key in our RDD) will be its URL.
Using a simple hash function to do the partitioning, pages with similar URLs (e.g.,
http://www.cnn.com/WORLD and http://www.cnn.com/US ) might be hashed to com‐
pletely different nodes. However, we know that web pages within the same domain
tend to link to each other a lot. Because PageRank needs to send a message from each
page to each of its neighbors on each iteration, it helps to group these pages into the
same partition. We can do this with a custom Partitioner that looks at just the
domain name instead of the whole URL.
Search WWH ::




Custom Search