Database Reference
In-Depth Information
The last two steps repeat for several iterations, during which the algorithm will con‐
verge to the correct PageRank value for each page. In practice, it's typical to run
about 10 iterations.
Example 4-25
gives the code to implement PageRank in Spark.
Example 4-25. Scala PageRank
// Assume that our neighbor list was saved as a Spark objectFile
val
links
=
sc
.
objectFile
[(
String
,
Seq
[
String
])](
"links"
)
.
partitionBy
(
new
HashPartitioner
(
100
))
.
persist
()
// Initialize each page's rank to 1.0; since we use mapValues, the resulting RDD
// will have the same partitioner as links
var
ranks
=
links
.
mapValues
(
v
=>
1.0
)
// Run 10 iterations of PageRank
for
(
i
<-
0
until
10
)
{
val
contributions
=
links
.
join
(
ranks
).
flatMap
{
case
(
pageId
,
(
links
,
rank
))
=>
links
.
map
(
dest
=>
(
dest
,
rank
/
links
.
size
))
}
ranks
=
contributions
.
reduceByKey
((
x
,
y
)
=>
x
+
y
).
mapValues
(
v
=>
0.15
+
0.85
*
v
)
}
// Write out the final ranks
ranks
.
saveAsTextFile
(
"ranks"
)
That's it! The algorithm starts with a
ranks
RDD initialized at 1.0 for each element,
and keeps updating the
ranks
variable on each iteration. The body of PageRank is
pretty simple to express in Spark: it first does a
join()
between the current
ranks
RDD and the static
links
one, in order to obtain the link list and rank for each page
ID together, then uses this in a
flatMap
to create “contribution” values to send to
each of the page's neighbors. We then add up these values by page ID (i.e., by the
page receiving the contribution) and set that page's rank to
0.15 + 0.85 * contribu
tionsReceived
.
Although the code itself is simple, the example does several things to ensure that the
RDDs are partitioned in an efficient way, and to minimize communication:
1. Notice that the
links
RDD is joined against
ranks
on each iteration. Since
links
is a static dataset, we partition it at the start with
partitionBy()
, so that it does
not need to be shuffled across the network. In practice, the
links
RDD is also
likely to be much larger in terms of bytes than
ranks
, since it contains a list of
neighbors for each page ID instead of just a
Double
, so this optimization saves