Database Reference
In-Depth Information
The last two steps repeat for several iterations, during which the algorithm will con‐
verge to the correct PageRank value for each page. In practice, it's typical to run
about 10 iterations.
Example 4-25 gives the code to implement PageRank in Spark.
Example 4-25. Scala PageRank
// Assume that our neighbor list was saved as a Spark objectFile
val links = sc . objectFile [( String , Seq [ String ])]( "links" )
. partitionBy ( new HashPartitioner ( 100 ))
. persist ()
// Initialize each page's rank to 1.0; since we use mapValues, the resulting RDD
// will have the same partitioner as links
var ranks = links . mapValues ( v => 1.0 )
// Run 10 iterations of PageRank
for ( i <- 0 until 10 ) {
val contributions = links . join ( ranks ). flatMap {
case ( pageId , ( links , rank )) =>
links . map ( dest => ( dest , rank / links . size ))
}
ranks = contributions . reduceByKey (( x , y ) => x + y ). mapValues ( v => 0.15 + 0.85 * v )
}
// Write out the final ranks
ranks . saveAsTextFile ( "ranks" )
That's it! The algorithm starts with a ranks RDD initialized at 1.0 for each element,
and keeps updating the ranks variable on each iteration. The body of PageRank is
pretty simple to express in Spark: it first does a join() between the current ranks
RDD and the static links one, in order to obtain the link list and rank for each page
ID together, then uses this in a flatMap to create “contribution” values to send to
each of the page's neighbors. We then add up these values by page ID (i.e., by the
page receiving the contribution) and set that page's rank to 0.15 + 0.85 * contribu
tionsReceived .
Although the code itself is simple, the example does several things to ensure that the
RDDs are partitioned in an efficient way, and to minimize communication:
1. Notice that the links RDD is joined against ranks on each iteration. Since links
is a static dataset, we partition it at the start with partitionBy() , so that it does
not need to be shuffled across the network. In practice, the links RDD is also
likely to be much larger in terms of bytes than ranks , since it contains a list of
neighbors for each page ID instead of just a Double , so this optimization saves
Search WWH ::




Custom Search