Database Reference
In-Depth Information
Therefore, we instead propose the following heuristic for finding the center node: we
choose the center vertex by sampling from the high-degree vertices. This heuristic
is based on the fact that vertices with large degree have small radii [28]. After find-
ing a center node, we need to renumber the edge file to swap the current minimum
vertex id with the center vertex id. The MapReduce algorithm for this renumbering
is shown in Algorithm 8.6. Since the renumbering requires only filtering, it can be
done with a map-only job.
8.4.6 a nalysis
Finally, we analyze the time and space complexity of GIM-V . It is not hard to observe
that one iteration of GIM-V takes O
nm
M
+
nm
M
+
log
time, where M stands for the
number of machines. Assuming uniformity, mappers and reducers of Stage1 and
Stage2 receive O
nm
M
+
records per machine. The running time is dominated by
the sorting time for nm
M
+
records. GIM-V requires O( n + m ) space.
8.5 SCALABILITY
We perform experiments to answer the following questions:
How does GIM-V scale up?
Which of the proposed optimizations (block multiplication, clustered edges,
and diagonal block iteration, vertex renumbering) gives the highest perfor-
mance gains?
The graphs we use in our experiments are shown in Table 8.2. We run PEGASUS
in M45 Hadoop cluster by Yahoo! and our own cluster composed of 9 machines. M45
is one of the top 50 supercomputers in the world with the total 1.5-PB storage and
3.5-TB memory. For the performance and scalability experiments, we used synthetic
Kronecker graphs [31] since we can generate them with any size, and they are one of
the most realistic graphs among synthetic graphs.
8.5.1 r esults
We first show how the performance of our method changes as we add more machines.
Figure 8.5 shows the running time and performance of GIM-V for PageRank with
Kronecker graph of 282 million edges and size 32 blocks if necessary.
In Figure 8.5a, for all of the methods the running time decreases as we add more
machines. Note that clustered edges ( GIM-V CL) did not help performance unless it
is combined with block encoding. When it is combined, however, it showed the best
performance ( GIM-V BL- CL).
In Figure 8.5b, we see that the relative performance of each method compared
with GIM-V BASE method decreases as number of machines increases. With three
Search WWH ::




Custom Search