Working with Key/Value Pairs - Learning Spark

Database Reference

In-Depth Information

The last two steps repeat for several iterations, during which the algorithm will con‐

verge to the correct PageRank value for each page. In practice, it's typical to run

about 10 iterations.

Example 4-25 gives the code to implement PageRank in Spark.

Example 4-25. Scala PageRank

// Assume that our neighbor list was saved as a Spark objectFile

val links = sc . objectFile [( String , Seq [ String ])]( "links" )

. partitionBy ( new HashPartitioner ( 100 ))

. persist ()

// Initialize each page's rank to 1.0; since we use mapValues, the resulting RDD

// will have the same partitioner as links

var ranks = links . mapValues ( v => 1.0 )

// Run 10 iterations of PageRank

for ( i <- 0 until 10 ) {

val contributions = links . join ( ranks ). flatMap {

case ( pageId , ( links , rank )) =>

links . map ( dest => ( dest , rank / links . size ))

}

ranks = contributions . reduceByKey (( x , y ) => x + y ). mapValues ( v => 0.15 + 0.85 * v )

}

// Write out the final ranks

ranks . saveAsTextFile ( "ranks" )

That's it! The algorithm starts with a ranks RDD initialized at 1.0 for each element,

and keeps updating the ranks variable on each iteration. The body of PageRank is

pretty simple to express in Spark: it first does a join() between the current ranks

RDD and the static links one, in order to obtain the link list and rank for each page

ID together, then uses this in a flatMap to create “contribution” values to send to

each of the page's neighbors. We then add up these values by page ID (i.e., by the

page receiving the contribution) and set that page's rank to 0.15 + 0.85 * contribu

tionsReceived .

Although the code itself is simple, the example does several things to ensure that the

RDDs are partitioned in an efficient way, and to minimize communication:

1. Notice that the links RDD is joined against ranks on each iteration. Since links

is a static dataset, we partition it at the start with partitionBy() , so that it does

not need to be shuffled across the network. In practice, the links RDD is also

likely to be much larger in terms of bytes than ranks , since it contains a list of

neighbors for each page ID instead of just a Double , so this optimization saves

Search WWH ::

Custom Search

Home