Working with Key/Value Pairs - Learning Spark

Database Reference

In-Depth Information

considerable network traffic over a simple implementation of PageRank (e.g., in

plain MapReduce).

2. For the same reason, we call persist() on links to keep it in RAM across

iterations.

3. When we first create ranks , we use mapValues() instead of map() to preserve the

partitioning of the parent RDD ( links ), so that our first join against it is cheap.

4. In the loop body, we follow our reduceByKey() with mapValues() ; because the

result of reduceByKey() is already hash-partitioned, this will make it more effi‐

cient to join the mapped result against links on the next iteration.

To maximize the potential for partitioning-related optimizations,

you should use mapValues() or flatMapValues() whenever you

are not changing an element's key.

Custom Partitioners

While Spark's HashPartitioner and RangePartitioner are well suited to many use

cases, Spark also allows you to tune how an RDD is partitioned by providing a cus‐

tom Partitioner object. This can help you further reduce communication by taking

advantage of domain-specific knowledge.

For example, suppose we wanted to run the PageRank algorithm in the previous sec‐

tion on a set of web pages. Here each page's ID (the key in our RDD) will be its URL.

Using a simple hash function to do the partitioning, pages with similar URLs (e.g.,

http://www.cnn.com/WORLD and http://www.cnn.com/US ) might be hashed to com‐

pletely different nodes. However, we know that web pages within the same domain

tend to link to each other a lot. Because PageRank needs to send a message from each

page to each of its neighbors on each iteration, it helps to group these pages into the

same partition. We can do this with a custom Partitioner that looks at just the

domain name instead of the whole URL.

Search WWH ::

Custom Search

Home